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1.  SUMMARY 


This  report  summarizes  our  effort  towards  building  a  robust  open-set  speaker  recognition 
system. 

In  Section  2,  we  introduce  the  structure  of  speaker  recognition  systems,  the  databases  we 
have  used  in  our  experiments,  and  the  evaluation  criteria  we  used  to  evaluate  the  results. 

In  Section  3,  we  review  the  various  techniques  we  have  used  for  acoustic  feature 
extraction.  First  we  review  the  acoustic  features  and  pre-/post-processing  techniques,  and 
propose  a  PPMD  feature  that  combines  their  advantages.  Next,  we  elaborate  on  a  number  of 
speaker  modeling  and  scoring  methods,  including  GMM  modeling,  SVM  modeling,  and  joint 
factor  analysis  (JFA).  We  also  introduce  some  score  normalization  techniques. 

In  Section  4  we  present  experiment  results.  The  experiments  compare  the  performances 
of  various  acoustic  features  and  pre-/post-processing  techniques,  and  show  that  PPMD 
achieves  a  better  performance  than  other  features.  The  experiments  also  show  that  SVM 
speaker  modeling  is  both  better  and  faster  than  GMM  modeling,  and  T-norm  of  the  scores 
improves  open-set  speaker  identification  performance  on  the  ROSSI  database. 

The  report  is  concluded  in  Section  5. 

2.  INTRODUCTION 

2.1  Block  Diagram  of  Speaker  Recognition  Systems 


SVM-negative  SVM-negative  SVM-negative 

Utterances  Features  GMMs 


Figure  1.  Block  Diagram  of  a  Typical  Speaker  Recognition  System 
(Dotted  Blocks  and  Arrows  Denote  Modules  Special  to  SVM  Speaker  Modeling) 
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Figure  1  is  the  block  diagram  of  a  typical  speaker  recognition  system.  A  complete  run  of 
the  system  consists  of  the  following  steps: 

(0)  Voice  activity  detection  (VAD).  This  is  a  preliminary  step  in  which  the 
speech-containing  regions  are  located  in  the  utterance.  Usually  people  don’t  spend  too  much 
effort  here;  they  usually  just  apply  an  energy-based  threshold  on  the  frames.  We  determine  the 
threshold  by  clustering  the  log-energy  values  of  the  frames  in  an  utterance  into  2  clusters 
using  the  k-means  algorithm. 

(1)  Feature  extraction.  This  turns  the  speech-containing  regions  into  sequences  of 
feature  vectors.  We  have  investigated  lots  of  features  (MFCC,  MHEC,  WMVDR,  Multitaper 
MFCC,  SCF  &  SCM,  HSCC,  FFV,  DSCC)  and  pre-/post-processing  techniques 
(pre-emphasis,  LP,  CMN,  RASTA,  Gaussianization,  delta  features).  Feature  fusion  also  goes 
here. 

(2)  UBM  training.  A  universal  background  model  (UBM)  is  trained  from  some 
background  utterances.  The  UBM  will  be  useful  in  the  MAP  training  during  speaker 
enrollment,  and  may  also  act  as  a  reference  in  scoring. 

(3)  Speaker  enrollment  and  modeling.  The  traditional  method  of  modeling  target 
speakers  is  Gaussian  mixture  modeling.  A  Gaussian  mixture  model  (GMM)  is  either  trained 
independently  or  adapted  from  the  UBM  for  each  target  speaker.  Common  adaptation 
techniques  include  relevance  MAP  adaptation  and  joint  factor  analysis  (JFA).  It  is  also 
possible  to  model  the  speakers  discriminatively  using  support  vector  machines  (SVMs),  in 
which  case  a  SVM  is  trained  for  each  speaker  using  his/her  own  GMM  as  a  positive  example 
and  the  GMMs  of  many  other  speakers  as  negative  examples. 

(4)  Scoring.  A  score  is  given  for  each  trial  utterance  u  against  the  model  of  each 
candidate  speaker  s  for  this  utterance  (or  the  UBM).  The  scoring  method  and  its  inputs 
depend  on  the  speaker  modeling  technique  used.  For  GMM  speaker  modeling,  the  score  is 
usually  calculated  from  the  target  GMM  and  trial  utterance  features.  This  is  called 
“frame-by-frame  scoring”,  and  it  can  be  approximated  in  various  ways  for  faster  speed.  For 
SVM  speaker  modeling,  the  score  is  calculated  from  the  target  SVM  and  a  GMM  trained 
from  the  trial  utterance. 

(5)  Score  fusion  and/or  normalization.  Nonnalization  techniques  include  Z-nonn, 
T-norm,  ZT-nonn  and  TZ-norm. 

(6)  Decision.  For  speaker  verification,  the  score  of  each  trial  utterance  against  its 
hypothesized  speaker  is  compared  against  a  global  threshold.  For  closed-set  speaker 
identification,  the  target  speaker  scoring  the  highest  is  picked  for  each  trial  utterance.  For 
open-set  speaker  identification,  the  highest-scoring  target  speaker  is  picked,  and  its  score 
compared  against  a  global  threshold  to  decide  whether  the  utterance  comes  from  the  target 
speaker  or  a  non-target  speaker. 

2.2  Databases  Used  in  the  IROSIS  Project 

The  primary  database  used  in  this  project  is  the  ROSSI  database.  This  is  a  database  for 
open-set  speaker  identification,  i.e.  one  has  to  decide  both  whether  the  trial  speaker  is  in  the 
target  set  and  who  the  speaker  is.  This  database  comprises  of  8  sets,  differing  in  language, 
channel,  environment,  etc.  Each  set  has  100  training  utterances  (1  for  each  target  speaker)  and 


Approved  for  Public  Release;  Distribution  Unlimited. 


2 


200  trial  utterances  (1  in-set  trial  utterance  for  each  target  speaker  and  100  out-of-set  trial 
utterances).  In  addition  to  these,  Set  1  has  196  development  utterances,  and  Sets  2~8  share 
400  development  utterances.  The  development  utterances  are  used  for  UBM  training,  SVM 
training  and  score  normalization. 

For  some  time  in  the  spring  of  201 1,  we  also  used  the  MIXER5  database.  This  database 
contains  speech  from  60  male  speakers  and  81  female  speakers.  Each  speaker  speaks  in  6 
sessions  (labeled  U  ~  Z)  of  half  an  hour  each,  and  the  speech  is  recorded  in  14  channels 
(labeled  A  ~  N).  We  selected  the  data  to  make  up  four  scenarios  that  are  similar  to  the  sets  in 
the  ROSSI  database.  We  used  Session  V  for  training  and  Session  Y  for  testing,  and 
mix -matched  the  channels  B  (close-talking)  and  L  (far- field).  The  four  scenarios  are  referred 
to  as  VB-YB,  VL-YL,  VB-YL  and  VL-YB  respectively. 


Table  1.  Detail  of  NIST  Data  Used  for  Training  and  Testing 


Purpose 

Source 

No. 

Speakers 

No. 

Utterances 

Duration 
(hr,  voiced  part 
only) 

Training  UBM 

2004,  2005,  2006 
single  train 

692 

867 

29 

Training  V 

2005,  2006  multi  train 

435 

2908 

98 

Training  D 

2004  multi  train 

125 

1382 

45 

Training  U 

2005,2006  TEL  test 

- 

2672 

90 

Negative 
examples  for 
SVM 

2004  single  train 

125 

246 

8 

Enrollment 

2008  short  train 

506 

648 

20 

Test 

2008  short  test 

279 

620 

19 

As  JFA  speaker  modeling  requires  much  more  data  than  the  ROSSI  database  could 
provide,  we  also  used  the  data  from  the  NIST  SRE  evaluations  in  2004,  2005,  2006,  and  2008. 
The  NIST  database  is  for  speaker  verification,  i.e.  one  only  needs  to  decide  whether  the  trial 
speaker  is  the  hypothesized  target  speaker  or  not.  Our  experiments  have  focused  on  the 
telephone  utterances  of  male  speakers.  Table  1  shows  how  we  chose  the  data  for  training  and 
testing. 

2.3  Evaluation  Criteria 

In  an  open-set  SID  system,  the  ground  truth  and  the  identification  result  can  be  classified 
into  five  cases: 

•  Correct  accept  (CA)  -  the  test  speaker  is  a  target  speaker,  and  the  system  identifies  it 
correctly. 

•  Speaker  confusion  error  (SE)  -  the  test  speaker  is  a  target  speaker,  but  the  system 
misclassifies  it  as  another  target  speaker. 
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•  False  reject  (FR,  also  called  “miss”)  -  the  test  speaker  is  a  target  speaker,  but  the 
system  classifies  it  as  an  imposter. 

•  Correct  reject  (CR)  -  the  test  speaker  is  an  imposter,  and  the  system  correctly 
classifies  it  as  an  imposter. 

•  False  accept  (FA,  also  called  “false  alarm”)  -  the  test  speaker  is  an  imposter,  but  the 
system  accepts  it  as  a  target  speaker. 

Evaluation  criteria  are  calculated  from  the  count  of  the  five  types  of  results.  We  used  the 
following  for  criteria  to  evaluate  our  open-set  SID  experiments  on  the  ROSSI  and  MIXER5 
databases: 


•  Closed-set  accuracy:  AccuClosed  = 


CA 

CA  +  SE 


with  the  threshold  9  set  to  minus 


infinity  (in  which  case  nothing  is  rejected,  and  FR  =  0). 

•  Open-set  accuracy:  The  harmonic  mean  of  the  correct  identification  rate  for  I-  and 
O-trials,  with  the  threshold  9  set  to  the  value  that  maximizes  the  accuracy,  which  is 
found  by  parameter  sweeping. 


AccuOpen  =  max 


r'  +  o 


1  = 


CA 


CA  +  SE  +  FR 


0  = 


CR 


CR  +  FA 


(1) 


•  Equal  error  rate  (EER):  This  criterion  only  evaluates  the  performance  of  in/out 
classification,  and  considers  speaker  confusion  errors  as  correct  classifications.  The 
EER  is  the  classification  error  rate  when  the  threshold  9  is  set  so  that  the  false  reject 


rate  and  false  accept  rate  are  identical: 

EER  =  -  CA  +  SE 


Set  6  so  that 


CR 


CA  +  SE  +  FR  CR  +  FA 

In  order  that  all  the  criteria  are  the  bigger  the  better,  we  report  1 


(2) 


EER  instead. 


•  Precision  biased  correct  decision  rate  (Eo.s):  This  criterion  also  only  considers  the 
in/out  classification,  but  is  focused  on  the  precision  and  recall  of  target  trials.  It  is 
defined  as  a  weighted  harmonic  mean  of  the  precision  and  recall.  Like  the  open-set 


accuracy,  the  threshold  9  is  set  to  maximize  the  criteria. 


n  .  .  CA  +  SE 

Precision  = - , 

CA  +  SE  +  FA 


Recall  = 


CA  +  SE 
CA  +  SE  +  FR 


(3) 


(1  +  /?“)•  Precision  -Recall 

max - - - , 

e  •  Precision  +  Recall 


/?  =  0.5 


(4) 


Closed-set  accuracy  evaluates  only  the  “identification”  performance  and  doesn’t 
consider  the  in-set  /  out-of-set  decision;  on  the  other  hand,  EER  and  F0.5  only  evaluate  the 
in/out  decision  performance.  Open-set  accuracy  is  a  good  overall  measure  of  the 
perfonnance. 

In  a  speaker  verification  task,  there  are  no  speaker  confusion  errors  (SE).  We  use  the 
equal  error  rate  (EER)  as  the  criterion  to  evaluate  our  speaker  verification  experiments  on  the 
NIST  database. 
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3.  METHODS,  ASSUMPTIONS,  AND  PROCEDURES 


3.1  Acoustic  Feature  Extraction 

3.1.1  Various  Acoustic  Features 

(0)  Baseline  MFCC  feature.  The  procedure  of  extracting  the  baseline  MFCC  feature  is 
shown  in  Figure  2.  Based  on  some  pilot  experiments  on  the  MIXER5  experiment,  we  chose  to 
use  CMN  (cepstral  mean  nonnalization)  as  the  normalization  technique  (in  which  we 
normalize  both  the  mean  and  the  variance  of  the  features).  We  didn’t  use  RASTA  filtering  [  1  ] 
or  spectral  subtraction  for  noise  reduction  as  they  didn’t  contribute  to  the  performance. 


Speech 


Figure  2.  Procedure  of  Extracting  the  Baseline  MFCC  Feature 


(1)  MHEC  (mean  Hilbert  envelop  coefficients)  [2].  The  procedure  of  extracting  the 
MHEC  feature  is  shown  in  Figure  3.  It  is  claimed  to  be  magically  robust  to  car  noise. 
However,  the  extraction  procedure  is  not  very  different  from  MFCC:  the  Gammatone 
interbank  is  similar  to  the  Mel  interbank  although  the  fonner  is  in  the  time  domain  and  the 
latter  is  in  the  frequency  domain;  the  calculation  of  the  Hilbert  envelope  is  similar  to  that  of 
the  energy  in  each  time-frequency  unit;  and  the  long-term  average  is  similar  to  CMS  (cepstral 
mean  subtraction).  There  seems  to  be  no  theoretical  foundation  why  MHEC  should  be  robust 
to  car  noise,  and  our  experiments  didn’t  support  this  claim,  either. 


*(">(?)  e(n,Q)  E(I,Q)  £„(/,£?) 


Figure  3.  Procedure  of  Extracting  the  MHEC  Feature 


(2)  WMVDR  (warped  minimum  variance  distortionless  response)  [3].  MVDR  is  a 
method  for  estimating  a  smoothed  version  of  a  signal’s  power  spectrum.  Its  warped  version, 
warped  MVDR  (WMVDR),  is  used  to  replace  the  FFT  and  filterbank  steps  in  the  extraction 
of  MFCC,  producing  a  smoother  power  spectrum  with  different  resolution  at  different 
frequencies  just  like  the  output  of  the  Mel  filterbank.  There  had  been  experiments  [3] 
showing  that  WMVDR  features  perfonn  better  than  MFCC  features.  But  our  experiments 
have  shown  that  the  performance  of  WMVDR  features  wasn’t  very  different  from  MFCC. 
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This  means  that  a  smoother  power  spectrum  doesn’t  have  much  effect  on  the  final 
identification  performance. 

(3)  Multitaper  MFCC  [4].  The  motivation  of  multitaper  MFCC  is  to  reduce  the  variance 
in  the  power  spectrum  estimation.  It  estimates  the  power  spectrum  of  a  frame  using  a  series  of 
different  windows  (also  called  “tapers”),  and  then  average  the  different  estimations  of  the 
power  spectrum.  The  operation  of  averaging  reduces  the  variance  in  the  power  spectrum  at 
the  cost  of  reduced  resolution,  but  resolution  is  not  important  because  the  power  spectrum 
will  be  fed  into  the  Mel  filterbank  anyway.  The  three  types  of  multitapers  introduced  in  [4]  all 
resulted  in  significant  improvement  in  closed-set  accuracy  on  the  ROSSI  database,  especially 
the  multipeak  series  with  8  windows. 

(4)  SCF  &  SCM  (spectrum  centroid  frequency  &  magnitude)  [5],  In  the  computation 
of  MFCC,  the  energy  distribution  within  each  band  is  ignored.  SCF  aims  to  incorporate  this 
infonnation  in  the  feature  as  well.  Defined  as  the  average  of  the  frequency  points  on  the 
spectrum  within  a  band  weighted  by  the  magnitude  at  these  frequency  points,  SCF  describes 
how  much  the  energy  distribution  deviates  from  the  center  frequency  in  each  band.  SCM  is 
defined  as  the  average  of  the  magnitudes  within  a  band  weighted  by  frequency,  and  can  be 
computed  efficiently  together  with  SCF.  Although  there’s  no  reason  to  weight  the  magnitude 
by  frequency,  because  the  bands  are  narrow,  SCM  is  actually  almost  equal  to  the  MFCC. 

Since  SCF  contains  information  that  is  absent  in  MFCC,  a  fusion  of  the  two  features  is 
expected  to  improve  the  perfonnance.  However,  the  perfonnance  of  SCF  alone  turned  out  so 
poor  that  neither  feature  fusion  nor  score  fusion  performed  as  well  as  MFCC. 

(5)  HSCC  &  FFV  (harmonic  structure  cepstral  coefficients  &  fundamental 
frequency  variation)  [6]  [7].  These  are  two  prosodic  features  that  provide  complementary 
infonnation  to  MFCC.  HSCC  models  the  probabilistic  distribution  of  the  fundamental 
frequency  by  hannonic  summation,  and  FFV  models  the  rate  of  variation  of  the  fundamental 
frequency  by  stretching  and  comparing  the  short-time  spectrum  of  adjacent  frames.  We 
earned  out  score  fusion  of  MFCC,  HSCC  and  FFV  on  the  MIXER5  database.  The  fusion 
resulted  in  marginal  improvement  in  closed-set  accuracy,  but  the  fusion  weights  were  quite 
counter-intuitive  (FFV  often  carries  a  larger  weight  than  MFCC). 

(6)  DSCC  (delta-spectral  cepstral  coefficients)  [8].  This  feature  aims  at  enhancing  the 
noise  robustness  of  delta  MFCC  features.  Its  extraction  procedure  is  compared  with  that  of 
MFCC  in  Figure  4.  In  extracting  delta  MFCC,  the  differential  is  performed  last,  long  after  the 
logarithm.  Additive  noise  can  have  a  destructive  effect  on  the  log  spectrum  filling  in  all  the 
deep  valleys,  therefore  the  differential  of  the  clean  log  spectrum  and  noisy  log  spectrum  will 
be  very  different.  But  on  the  spectrum  before  logarithm,  the  effect  of  additive  noise  is 
relatively  small.  DSCC  calculates  the  differential  before  applying  the  logarithmic 
non-linearity,  and  since  the  differential  can  have  negative  values,  the  logarithm  is  replaced  by 
Gaussianization.  Our  experiments  have  shown  that  DSCC  is  more  robust  to  noise  than  delta 
MFCC. 


Approved  for  Public  Release;  Distribution  Unlimited. 


6 


Speech 


Speech 


Figure  4.  Comparison  of  the  Extraction  of  the  Delta  MFCC  and  DSCC  Features 


3.1.2  Pre-  and  Post-Processing  Techniques 

(1)  Pre-emphasis  and  delta  features.  These  are  traditional  pre-  and  post-processing 
techniques.  We  have  verified  that  they  do  improve  the  performance  on  the  ROSSI  database, 
both  closed-set  and  open-set. 

(2)  Linear  prediction  (LP)  pre-processing  for  noise  reduction.  Linear  prediction  can 
be  used  for  speech  coding,  because  speech  signals  have  patterns  and  are  largely  predictable. 
On  the  other  hand,  noise  is  random  and  unpredictable.  This  inspired  us  to  think  that 
pre-processing  a  noisy  signal  with  linear  prediction  can  increase  the  signal-to-noise  ratio 
(SNR)  of  the  signal.  We  applied  LP  with  block  filtering  up  to  20th  order;  later  we  discovered 
some  related  work  [9]  that  used  least-mean-squares  (LMS)  filtering  up  to  several  hundredth 
order.  Most  combinations  of  filtering  method  and  filter  order  were  confirmed  to  increase  the 
SNR,  and  the  typical  SNR  gain  is  between  5  —  10  dB.  However,  the  increase  in  SNR  didn’t 
translate  directly  into  improved  performance  of  the  back-end  system.  We  experimented  with 
three  back-end  systems  (speaker  recognition  on  the  MIXER5  database,  continuous  spoken 
digits  recognition,  and  HMM-based  pitch  tracking),  and  the  performance  of  all  the  three 
systems  stayed  the  same  no  matter  we  applied  LP  pre-processing  or  not. 

(3)  Short-time  feature  Gaussianization  [10],  This  is  an  advanced  version  of  CMN. 
Gaussianization  not  only  nonnalizes  the  mean  and  variance  of  the  features,  but  warps  them 
non-linearly  to  conform  to  the  Gaussian  density  function.  In  short-time  feature 
Gaussianization,  a  feature  value  that  is  the  k- th  smallest  in  an  /V-point  window  around  it  will 

be  warped  to  O  \(k  -  1)/  N\,  where  O  is  the  Gaussian  cumulative  density  function  (CDF). 

Short-time  feature  Gaussianization  has  shown  consistent  performance  improvement  across 
most  conditions,  although  there  hasn’t  been  any  determinative  explanation  as  to  why.  A 
plausible  explanation  is  that  a  Gaussian  distribution  of  the  features  agrees  best  with  the  GMM 
speaker  modeling. 

3.1.3  Best  Feature  Selected 

Our  experiments  with  the  various  types  of  acoustic  features  and  pre-/post-processing 
techniques  (see  Section  4. 1)  showed  the  following  to  be  beneficial:  Multitaper  MFCC  (with  8 
multipeak  tapers),  DSCC,  pre-emphasis,  short-time  feature  Gaussianization.  We  combined 
these  all  and  arrive  at  a  40-dimensional  acoustic  feature  whose  extraction  procedure  is  shown 
in  Figure  5.  According  to  our  old  naming  conventions,  this  feature  should  be  called 

Approved  for  Public  Release;  Distribution  Unlimited. 


7 


PRE_PEAK8_MFCC20_GAU  S  S3  00  +  PRE_PEAK8_DSCC20(GAUSS300)_CMN 
And  we  call  it  PPMD  for  short. 
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Figure  5.  Procedure  of  Extracting  PPMD  Features 


In  addition  to  this,  we  also  used  a  version  of  MFCC  features  that  was  used  previously  in 
other  projects.  This  feature  also  has  40  dimensions,  and  we  call  it  RM40.  RM40  shows 
slightly  inferior  performance  to  PPMD  on  the  ROSSI  database,  but  is  consistently  better  than 
PPMD  on  the  NIST  database.  This  verifies  the  fact  that  there  is  no  universal  best  feature. 

3.2  Speaker  Modeling  and  Scoring 

3.2.1  Gaussian  Mixture  Modeling  (GMM) 

The  basic  method  of  modeling  the  speech  of  a  target  speaker  is  Gaussian  mixture  models 
(GMM).  A  GMM  is  made  up  of  C  components  (in  our  experiments,  C  =  256  for  ROSSI  and 

MIXER5,  and  C  =  1024  for  NIST),  each  component  i  having  a  mean  vector  //( ,  a  (typically 

diagonal)  covariance  matrix  E . ,  and  a  weight  vy . 

The  GMM  for  a  target  speaker  can  be  trained  entirely  from  the  data  of  this  speaker,  but  it 
is  usually  preferred  to  adapt  it  from  the  UBM.  There  are  two  reasons  to  favor  adaptation:  there 
may  be  not  enough  data  to  train  a  speaker  GMM  robustly,  while  UBM  can  provide  a 
reasonable  prior;  the  components  of  an  adapted  GMM  are  aligned  with  those  of  the  UBM, 
which  is  crucial  for  the  SVM  modeling  to  be  introduced.  In  adaptation,  usually  the  covariance 
matrices  and  weights  are  kept  fixed,  and  only  the  component  means  are  adapted. 

3.2. 1.1  GMM  Adaptation  Techniques 

Common  GMM  adaptation  techniques  include  relevance  MAP  adaptation  [11]  and  joint 
factor  analysis  (JFA). 

In  relevance  MAP  adaptation,  first  an  EM  iteration  is  run  with  the  UBM  as  the  initial 
value  to  obtain  a  set  of  //( ’s.  Then  these  //( ’s  are  interpolated  with  those  of  the  UBM  via  a 
relevance  factor  r: 
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Hi  <■ 


UtMi+rMt 


UBM 


n;  +r 


(5) 


where  nt  is  the  soft  count  of  feature  vectors  aligned  with  component  i.  This  interpolation  has 

the  effect  that  the  less  data  a  component  has,  the  more  its  mean  is  pulled  back  to  the  UBM 
mean.  The  appropriate  value  of  r  depends  on  the  amount  of  data  available,  but  the 
perfonnance  is  found  to  be  not  very  sensitive  to  r.  A  typical  value  of  r  is  16. 

Relevance  MAP  adaptation  is  called  “MAP”  because  it  is  equivalent  to  imposing  a  prior 
on  the  component  means: 

,U,)  (6) 

r 

Joint  factor  analysis  is  a  more  sophisticated  version  of  MAP  adaptation,  where  the  parameters 
of  the  prior  also  need  to  be  trained  from  data.  Due  to  its  complexity,  it  deserves  separate 
explanation  in  Section  3.3. 


3.2. 1.2  Scoring  Methods  with  GMM  Modeling 

The  basic  method  to  score  an  utterance  against  a  GMM  is  full  frame-by-frame  scoring. 
Denote  the  feature  vectors  of  the  utterance  by  X  =  {X1,...,Xn}  ,  and  the  GMM  by 

G  =  {/Lij ,  ,  w(. } ,  then  the  score  is  defined  as  the  frame-average  log-likelihood: 

score(X,  G)  =  -  J  log  £  wiP{Xt ;  jut ,  2, )  (7) 

n  ,=i  ,=i 

where  P(X;PX)  is  the  multivariate  Gaussian  density  function.  This  score  is  usually 

adjusted  by  subtracting  the  score  of  the  trial  utterance  against  the  UBM. 

Full  frame-by-frame  scoring  can  be  slow  due  to  the  summation  across  the  GMM 
components.  When  an  utterance  needs  to  be  scored  against  multiple  GMMs  (e.g.  in  speaker 
identification,  or  in  speaker  verification  where  an  utterance  needs  to  be  verified  against 
multiple  speakers),  it  can  be  beneficial  to  use  fast  frame-by-frame  scoring  ([11],).  In  fast 
frame-by-frame  scoring,  the  second  summation  doesn’t  go  across  all  components,  but  only  a 
few  (say,  5)  top  components: 

score(X,G)  =  -^log  £  wtp{Xt\iit,Y.t)  (8) 

n  ,=1  ieTC, 

The  “top  components”  set  TCt  is  chosen  for  each  feature  vector  based  on  the  UBM:  it 
contains  the  components  that  yield  the  several  highest  values  of  wiP{Xt\  Pli"IM ,  X  ) . 
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3.2.2  Support  Vector  Machine  (SVM)  Modeling 


GMM  is  a  generative  modeling  technique  of  speakers,  but  speakers  can  also  be  modeled 
discriminatively  with  support  vector  machines  (SVM).  For  each  target  speaker,  a  GMM  is 
adapted  from  the  UBM  for  each  training  utterance,  and  these  are  used  as  the  positive  training 
examples  for  the  SVM.  In  addition,  a  GMM  is  adapted  from  the  UBM  for  each  utterance  in  a 
pool  of  “SVM -negative”  utterances,  and  these  are  used  as  negative  training  examples  for  the 
SVMs  of  all  target  speakers.  It  is  possible  to  re-use  the  utterances  used  for  UBM  training  as 
SVM-negative  utterances. 

To  train  an  SVM  with  GMMs  as  training  examples,  it  is  necessary  to  come  up  with  a 
kernel  function  for  GMMs.  Because  the  component  covariances  and  weights  are  not  adapted, 
a  GMM  can  also  be  represented  by  the  collection  of  its  component  means.  The 
CF-dimensional  vector  obtained  by  stacking  the  F-dimensional  mean  vectors  of  the  C 
components  is  called  the  supervector.  A  kernel  function  for  GMMs  is  an  inner  product  of  the 
supervectors,  optionally  with  some  normalization  and  weighting  with  the  covariance  matrices 
and  component  weights. 

We  used  the  following  six  types  of  kernel  functions: 


=  Z,  kfj“ exp 


-^BHATT  (Ma,Mfe)  =  ^.exp 


8 


(9) 

(10) 

(11) 

(12) 

(13) 


^WBHATT  (M- a  ’  ^ b  )  ^  /  Wi  eXP 
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(14) 


These  functions  are  derived  from  different  motivations;  see  A  for  details.  Despite  their 
different  motivations,  the  forms  of  the  kernels  look  similar,  and  it  is  possible  to  tweak  the 
powers  and  constants  in  the  formulas  to  make  new  kernels.  Our  experience  (see  Section  4.2 
for  example),  however,  consistently  shows  that  there’s  not  much  difference  between  the 
performances  of  five  of  these  kernels,  except  for  L2  which  performs  significantly  worse. 

To  score  a  trial  utterance  against  the  SVM  of  target  speaker,  first  adapt  a  GMM  for  this 
utterance,  then  substitute  this  GMM  into  the  discriminant  function  of  the  SVM.  For  a 
negative  trial  utterance,  the  score  is  usually  around  -1.0.  However,  because  the  SVMs  are 
usually  trained  with  very  few  positive  examples  (often  only  one),  the  score  can  be  negative 
even  for  positive  trial  utterances.  Our  experience  is  that  the  scores  for  positive  trial  utterances 
are  usually  around  -0.7.  As  a  result,  we  cannot  directly  perform  speaker  verification  based  on 
the  sign  of  this  score,  but  need  to  tune  the  threshold  on  development  data. 


Approved  for  Public  Release;  Distribution  Unlimited. 


10 


Our  experiments  on  both  the  ROSSI  database  (see  Section  4.2)  and  the  NIST  database 
(see  Section  4.3)  have  shown  that  SVM  modeling  is  significantly  better  than  GMM  modeling. 

3.3  Joint  Factor  Analysis  (JFA)  Modeling 

JFA  is  a  generalization  of  relevance  MAP  adaptation  in  GMM  modeling.  It  assumes  that 
the  supervector  m  of  a  speaker  GMM  can  be  decomposed  as 

m=M  +  Vy  +  Ux  +  Dz  (15) 

In  this  formula,  M  is  the  UBM  supervector;  V  and  U  are  tall  and  thin  matrices  with  CF  rows 
and  only  a  few  hundred  columns  whose  columns  are  eigenvoices  and  eigenchannels;  D  is  a 
CF-by-CF  diagonal  matrix;  y,  x,  z  are  random  vectors  that  have  sizes  compatible  with  the 
matrices,  and  obey  the  standard  normal  distribution.  In  other  words,  it  imposes  on  the  speaker 
GMM  supervector  m  a  prior  distribution  of  the  following  form: 

m~  N  (M ,WT +UUT  +  D2)  (16) 

It  can  be  shown  that  relevance  MAP  adaptation  is  a  special  case  of  JFA  with  V  =  U  =  0  and 
D2  =  £  /  r  (2  is  a  diagonal  block  matrix  whose  blocks  are  the  component  covariance 
matrices). 

In  Eq.  (15),  the  terms  Vy  and  Dz  are  considered  to  come  from  the  speaker  characteristics, 
while  the  tenn  Ux  is  considered  to  arise  due  to  environment  and  channel  variations.  Therefore 

the  latter  term  is  dropped,  and  the  supervector  is  take  to  be  the  sum  M  +  Vy  +  Dz  . 

To  obtain  GMM  models  (or  supervectors)  for  a  set  of  speakers  via  JFA,  there  are  two 
major  steps.  First,  the  matrices  V,  U,  D  specifying  the  prior  distribution  must  be  trained  from 
data.  Second,  the  vectors  y,  x,  z  need  to  be  estimated  under  the  MAP  criterion.  Both  these 
steps  involve  complicated  mathematics;  we  give  a  tutorial  in  Section  3.3.1. 

As  an  extension  of  relevance  MAP  adaptation,  JFA  also  falls  in  the  category  of  GMM 
speaker  modeling,  and  therefore  can  be  used  in  combination  with  SVM  speaker  modeling.  It 
was  our  goal  to  show  that  JFA  +  SVM  modeling  would  be  better  than  relevance  MAP  +  SVM, 
but  unfortunately  we  weren’t  able  to  achieve  this  goal. 

JFA  was  first  proposed  in  a  fully  convolved  form  [12],  and  was  later  simplified  by 
multiple  researchers,  resulting  in  many  variants.  These  variants  and  our  choice  will  be 
introduced  in  Section  3.3.2.  In  order  to  deal  with  the  channel  mismatch  between  the  target 
GMM  (where  the  channel  effect  Ux  has  been  dropped)  and  the  trial  utterance,  many  scoring 
methods  have  been  proposed  for  JFA  [13],  with  different  levels  of  approximation.  These 
scoring  methods  will  be  introduced  in  Section  3.3.3.  Our  experiment  setup  and  results  of  JFA 
will  be  given  in  Section  4.3. 
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3.3.1  A  Tutorial  of  JFA 


3.3. 1.1  ML  training  of  GMMs,  and  Baum- Welch  statistics 


To  fully  understand  the  mathematics  involved  in  JFA,  it  is  helpful  to  review  the 
maximum  likelihood  (ML)  training  of  GMMs. 

Suppose  we  want  to  train  a  GMM  with  C  components  from  a  set  of  F-dimensional  - 

feature  vectors  X  =  {Xx ,...,XT)  (all  column  vectors).  For  the  c-th  component  ( c  =  1, . . . , C ), 


we  denote  the  mean  vector  by  m.  (column  vector),  the  covariance  matrix  by  Sc ,  and  the 


component  weight  by  wc  .  We  assume  the  covariance  matrices  to  be  diagonal.  For 


conciseness,  we  define  the  CF-dimensional  “supervector”  m  as  the  vertical  concatenation  of 
the  mean  vectors,  and  the  CF  x  CF  “supercovariance”  matrix  £  as  the  diagonal 
concatenation  of  the  covariance  matrices. 

In  our  problem,  we  assume  the  covariance  matrices  and  the  components  to  be  fixed  and 
known,  and  we  need  to  find  the  supervector  m  that  best  fits  the  data  X.  Under  the  maximum 
likelihood  (ML)  criterion,  we  seek  to  maximize  the  log  likelihood  of  the  data  X  given  the 
supervector  m: 


L(m)  =  log  P(X  |  m)  =  £  log  P(Xt  \  m )  =  X  l°g  H  P(Zt  )p(Xt  I  z,  > m) 


=2>«I 


w„ 


FI 2  I  ^  1 1/2 


(2k)  |  £ 


exp 


-±(Xt-mc)TX-c\Xt-mc) 


(17) 


where  Z(  =  1,...,C  refers  to  the  index  of  the  component  that  generated  the  data  sample  Xt . 

This  objective  function  is  hard  to  optimize  directly  because  of  the  sum  nested  in  the  log,  and 
the  usual  solution  is  the  EM  algorithm.  The  EM  algorithm  requires  an  auxiliary  function 

Q(m,  m ')  that  satisfies 


L(m)  =  Q(m,  m),  Q(m,  m  ’)  <  L(m  ’)  (18) 

In  each  EM  iteration,  the  auxiliary  function  Q(m,m')  is  maximized  w.r.t.  m\  and  the  old 
supervector  m  is  replaced  by  the  new  supervector  m\  Eq.  (18)  guarantees  that  the  objective 
function  L(m)  is  non-decreasing. 

The  expression  of  the  auxiliary  function  is  given  by: 
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Q(m,m ')  =  L(m ')  -  X  KL(Zt  I  Xt > m  \\z,\Xn  m ') 

t 

=  ^\ogP(X,  NVSSPCZ.  |x„m)l°g  ^ 

t  t  Zt  r\Zjt  I  A/J  m  ) 


YLP(Z,  \X„m) 


t  zt 


log.P(X,  |  m’)-log 


P(Z,|X„m) 
P(Z,  | 


=£2>(z,|;r„m)iog  P('y,'z,l"l’> 

v  <’  6  P(Zt\Xt,m) 


t  zt 


(19) 


=  ZZP(Z,  I  Z„»i) log P(X„Z,  I  »HZ*(Z,  I 

t  Zt  t 

The  second  term  (entropy)  is  a  “constant”  that  does  not  depend  on  m\ 

In  the  E-step  of  an  EM  iteration,  we  calculate  the  posterior  probability  of  the  component 


index  given  the  data  P(Zt  \  Xt ,  m) ,  according  to  the  old  supervector  in.  This  step  is  also  called 


aligning  the  data  with  the  components.  We  denote  P(Zt  =c\Xt,m )  by  Ntc  for  short,  and 

define  the  following  statistics  (called  Baum-Welch  statistics  [14])  of  the  data  X  for  each 
component,  which  will  be  useful  in  the  M-step: 

•  Zeroth  order  statistics  Nc  ='^JNlc:  the  “number”  of  data  samples  aligned  with  the 

t 

c-th  component. 

•  First  order  statistics  Fc  =  X  xtcXt :  the  sum  of  the  data  samples  aligned  with  the  c-th 

t 

component. 

•  Second  order  statistics  Sc  =  X  ^ tcX tX]  :  the  sum  of  the  “squares”  of  the  data 

t 

samples  aligned  with  the  c-th  component.  This  will  not  be  useful  since  we’re  only 
training  the  mean  vectors  of  the  GMM. 

We  also  define  a  CF  x  CF  diagonal  matrix  N  whose  diagonal  blocks  are  NCIF ,  where 


c  =  1, . . . ,  C  and  IF  is  a  unit  matrix  of  order  F,  and  a  CF-dimensional  column  vector  F  as  a 


vertical  concatenation  of  the  Fc  for  all  the  components.  It’s  worth  pointing  out  that,  although 

not  explicitly  noted,  the  Baum-Welch  statistics  (including  N  and  F)  are  functions  of  the  data  X 
and  the  supervector  m. 

Now  we  look  at  the  M-step,  in  which  we  maximize  Q(m,m')  w.r.t.  in ’ .  We  simplify 


Q(m,m')  as  follows: 
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Q(m,m ')  =  constant  +  XX^2.  |  Xt ,  m)  log  P(Xt ,  Z,  |  m ') 


=  constant  +  XX^. 


log 


w„ 


(2 n)Fn  |  Xc  |1/2 


-Ux-m'ttiX-m'Z 


=  constant  +  £  Z Ntc  (m '/  ~ 1 m '/  'c ) 


(20) 


=  constant  +  nz ’  X  F — in'  X  /Vm ' 

2 


Setting  y,n.Q(m, m ')  =  X  F -X  1  A/m  ’  =  0 ,  we  obtain  the  optimal  new  supervector: 


m  =  N~xF  (21) 

which  is  the  familiar  conclusion  of  setting  the  new  mean  vector  of  each  component  as  the 
average  of  the  data  aligned  with  that  component. 

The  entire  EM  algorithm  starts  by  initializing  the  supervector  in  randomly  or  using 
Fmeans  clustering,  and  then  repeats  the  EM  iteration  for  a  given  number  of  times  or  until 
convergence.  The  Baum- Welch  statistics  are  re-calculated  in  every  iteration. 


3.3. 1.2  MAP  adaptation  of  GMMs  in  general 

Sometimes  we  don’t  have  enough  data  to  train  a  GMM  with  ML  robustly.  In  such  cases 
we  can  train  the  GMM  with  MAP,  i.e.  imposing  a  prior  distribution  on  the  supervector  in.  The 

most  common  prior  is  the  Gaussian  N  (M,Z)  ,  where  M  is  typically  the  supervector  of  the 

UBM.  The  MAP  objective  function  is: 

L  \m )  =  L(m )  -  ^  ( m  -  M)r  Z  1  (m  -  M )  (22) 

Similarly,  the  auxiliary  function  will  also  include  the  prior  tenn: 

Q \m, m ')  =  constant  +  m ,z  X  lF  - m  'T  X  ' Nm (m M )r Z  1  (in M )  (23) 

Setting  Vm,Q \m,m')  =  X_1F-X_1Am'-Z_1(m'-M)  =  0  ,  we  obtain  the  optimal  new 
supervector: 

in  =  (/  +  ZX  1  A)_1  (M  +  ZX  [F)  (24) 

Because  calculating  the  Baum- Welch  statistics  can  be  time-consuming,  in  MAP  adaptation 
we  usually  perfonn  only  one  iteration,  initializing  the  supervector  to  be  that  of  the  UBM. 
Therefore  the  Baum-Welch  statistics  are  also  calculated  against  the  UBM. 

Now  is  a  good  chance  to  understand  why  the  relevance  MAP  adaptation  is  called  MAP. 
Let  the  covariance  matrix  of  the  prior  distribution  be 

Z=-X  (25) 

r 

where  r  is  the  relevance  factor.  It  is  easy  to  see  that  the  adapted  supervector  will  be 
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m  =  (rI  +  Ny\rM  +  F ) 


(26) 


which  is  exactly  the  rule  of  relevance  adaptation. 

3.3. 1.3  GMM  adaptation  using  JFA 

JFA  is  just  a  way  of  specifying  the  prior  distribution  of  the  supervector  in.  It  fonnulates  m 
as 

m  =  M  +  Ux  +  Vy  +  Dz  (27) 

In  this  formula,  M  is  the  UBM  supervector.  U  and  V  are  tall-and-thin  matrices  with  CF  rows 
(typically  between  104  and  105)  but  only  several  hundred  columns,  and  D  is  a  CF  x  CF 
diagonal  matrix,  x,  y,  z  are  mutually  independent  random  column  vectors  compatible  with  the 

size  of  U,  V,  D,  and  obeying  the  standard  Gaussian  distribution  N  (0,1)  .  The  tenn  Ux  models 

the  variability  in  the  supervector  due  to  the  channel,  and  Vy  +  Dz  models  the  variability  due  to 
the  speaker.  The  columns  of  U  and  V  act  as  eigenvectors,  and  are  called  eigenchannels  and 
eigenvoices  respectively.  The  elements  of  x  and  y  are  called  channel  factors  and  speaker 
factor1. 

Eq.  (27)  effectively  specifies  the  following  prior  distribution: 

m  ~  N  (M,UUt  +  VVt  +D2)  (28) 

It  seems  we  can  substitute  Z  =  UUT  +  WT  +  D 2  into  Eq.  (24),  and  the  problem  is  solved. 
But  there  are  two  issues: 

(1)  Calculating  Eq.  (24)  involves  inverting  a  CF  x  CF  non-diagonal  matrix,  which  is 
computationally  intractable; 

(2)  The  channel  variability  modeled  by  the  Ux  term  is  actually  a  nuisance  that  we 
wouldn’t  like  to  include  in  the  updated  supervector,  i.e.  we  want  m  =  M  +  Vy  +  Dz . 
This  requires  us  to  explicitly  solve  for  the  optimal  adapted  values  of  x,  v,  and  z. 

In  order  to  do  this,  we  can  write  the  MAP  auxiliary  function  in  tenns  of x,  v,  z  and x\y\ 
z’: 

Q  \x,  y,  z,  x ',  y ',  z ')  =  constant  +  (M  +  Ux '+  Vy '+  Dz  ')rI.^F 

~{M  +  Ux'+  Vy  ’+  Dz  Y  2T1 N(M  +  Ux  ’+  Vy  ’+  Dz  ’)  (29) 

-~x'T  x'--y'T  v'--z’rz’ 

2  2  2 

In  a  more  general  version  of  the  EM  algorithm,  we  don’t  need  to  maximize  the  auxiliary 
function  in  the  M-step;  we  only  need  to  increase  it.  Therefore  we  take  a  two-step  approach 
following  the  code  in  [15]:  in  the  first  step  we  find  x’  and  v’  to  maximize  Q ’  assuming  z’  =  0, 


1  The  terminology  is  not  unified.  In  some  literature,  the  columns  of  U  and  V are  called  channel  factors  and  speaker  factors, 
while  the  elements  of  x  and  y  are  called  factor  loadings. 


Approved  for  Public  Release;  Distribution  Unlimited. 


15 


and  in  the  second  step  we  find  z’  to  maximize  Q ’  assuming  the  values  x’  and  y  ’  found  in  the 

last  step.  Because  Ux  +  Vy  =  [U  V][x;y]  (where  the  semicolon  stands  for  vertical 

concatenation),  the  method  for  solving  for  x’  and  y’  jointly  is  identical  to  that  for  solving  v’ 
alone  assuming  x’  =  0.  We  absorb  everything  in  Eq.  (29)  that  doesn’t  depend  on  y’  into  the 
constant: 

Q ’( y,  y ')  =  constant  +  y }T  VTZ ~x (F  -  NM)  ~ y'T(I  +  VT^lNV)y  ’  (30) 

Setting  VyQ'(y,yr)  =  VT’Z~l(F  -NM)-(I  +  VtT ~1NV)y'  =  0  ,  we  obtain  the  optimal  new 
speaker  factors: 

y  =  (I  +  VTYrlNV)-lVTY.-\F-NM)  (31) 

If  we  solved  forx  and  y  jointly,  then  the  optimal  new  factors  would  be: 

[x;y]  =  (I  +  [U  VfYTxN[U  V]y\U  Vf'Z1(F-NM)  (32) 

The  matrix  inversion  here  is  no  challenge  because  the  order  of  the  matrix  is  only  several 
hundred. 

By  letting  x'  =  x,  y'  =  y  and  setting  the  gradient  of  Eq.  (29)  w.r.t.  z’  to  0,  we  can  obtain 
the  optimal  z’: 

z  =  (I  +  D2YrlNylDL-\F  -  N(M  +  Ux  +  Vy)\  (33) 

The  matrix  inversion  here  is  also  easy  because  the  matrix  to  invert  is  diagonal. 

As  said  in  the  previous  subsection,  in  MAP  adaptation  we  only  do  one  EM  iteration, 
initializing  the  supervector  to  that  of  the  UBM.  Therefore  the  Baum-Welch  statistics  N  and  F 
in  Eqs.  (32)  and  (33)  are  calculated  against  the  UBM.  Since  we  do  not  want  to  include  the 
channel  variability  in  the  adapted  GMM,  the  adapted  supervector  will  be 

m  =  M  +  Vy  +  Dz  (34) 


3.3. 1.4  Training  the  matrices  U,  V,  D 

The  procedure  described  in  the  last  subsection  happens  in  the  “speaker  enrollment”  step, 
where  the  matrices  U,  V,  D  are  known,  and  the  training  data  X  are  the  feature  vectors 
extracted  from  the  utterance(s)  of  each  speaker.  However,  the  matrices  U,  V,  D  need  to  be 
trained  before  enrolling  the  speakers,  and  this  problem  is  dealt  with  in  this  subsection. 

We  train  these  matrices  one  by  one  in  the  order  of  V,  U,  D,  following  the  code  in  [15]. 
When  training  one  matrix,  we  assume  that  the  matrices  not  yet  trained  are  equal  to  zero.  We 
don’t  train  the  matrices  jointly  because  it  would  be  mathematically  much  more  difficult,  and 
also  because  it  is  pointed  out  in  [16]  that  when  they’re  trained  jointly,  D  often  gets 
undertrained  -  most  of  the  speaker  variability  is  accounted  for  by  Vy,  and  Dz  has  little  effect. 
This  is  probably  because  Uhas  many  more  degrees  of  freedom  than  D. 

Before  describing  the  training  algorithm,  which  is  explained  in  detail  in  [17],  we  shall 
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point  out  an  implicit  approximation  made  by  that  paper  on  the  data  likelihood  P(X  \  in) . 
From  Eq.  ( 1 8),  we  know  that 

log  P(X  |  m)  =  L(m )  >  Q(M,  in)  (35) 


where  Mis  the  UBM  supervector,  and  that  the  difference  between  L(m)  and  Q(M,m )  is  the 

Kullback-Leibler  divergence  between  the  “alignment”  of  the  data  against  the  UBM  and  the 
GMM  with  supervector  in.  We  assume  the  divergence  is  small  enough  to  neglect,  thereby 
approximating  the  data  likelihood  by 

log  P(X  |  m)  «  Q(M,m )  (36) 


Then,  from  Eq.  (20),  we  have 

log P(X  |  m)  «  constant  +  mT'L~1F  —  ^mT'Z~1Nm  (37) 

where  the  Baum- Welch  statistics  N  and  F  are  calculated  against  the  UBM.  Applying  this 
approximation  has  the  advantage  that  the  Baum- Welch  statistics  can  be  pre-calculated  against 
the  UBM  and  stored,  and  needn’t  be  re-calculated  the  in  the  iterative  procedure  of  training  U, 
V,D. 


(1)  Training  the  V  matrix 

Training  the  V  matrix  requires  utterances  from  multiple  speakers,  and  ideally  multiple 


utterances  from  each  speaker  to  average  out  the  channel  variability.  Denote  by  A (s )  the 


feature  vectors  of  the  speaker  s.  The  training  objective  of  the  V  matrix  is  to  maximize  the  total 
data  likelihood: 


Z(U)  =  XlogjP(A(s)|U)  (38) 

5 


The  conditioned  probability  may  look  weird,  but  we  can  plug  in  the  hidden  variable  y  (5) ,  just 


like  Z,  in  Eq.  (17): 

Z(U)  =  ]Tlog  f  P(y(s))-P(X(s)|y(s),U)  d y(s)  (39) 

y(s) 

Again,  we’re  having  an  integral  nested  within  a  log,  which  we  tackle  with  the  EM  algorithm. 
The  auxiliary  function  is: 

Q(V,  V ')  =  constant  +  X  J  p(y(s)  I  x(s)> v)  lo§  p(x TOO  I  v ')  dy(j) 

*  ^  (40) 

=  constant  +  X  J  p(y(s)  I  x(s)> v)  lo§  p(x (s)  I  V ')  dy(v) 

*  r(s) 

For  clarity,  we  will  drop  the  argument  (5)  hereafter.  The  J  P(y  \  X,V)[-]  d y  part  is  actually 

y 

an  expectation  operator  over  the  posterior  distribution  of  y  given  the  data  X,  which  we  will 
denote  as  E  for  short.  With  the  approximation  of  Eq.  (36),  this  posterior  distribution  is  given 
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by  Eq.  (30),  i.e. 


y  ~  N  {{I  +  VT^lNVyxVT^\F -NM),  (/  +  VT'L~1NV)~1)  (41) 

where  the  Baum-Welch  statistics  are  calculated  against  the  UBM.  From  this  we  can  calculate 
some  statistics  about  y: 

Ey  =  (/  +  V^NVy1  VT^1  (. F  -  NM)  (42) 

E(  yyT  )  =  (Ey)(Ey)r  +  (I  +  V^NVy1  (43) 

which  will  be  useful  in  simplifying  Q(V,  V ') . 

Now  let’s  go  on  simplifying  Eq.  (40).  The  part  log  P(X  \  y,V)  is  approximated  by  Eq. 
(37): 

Q(V,  V')  =  constant  +  £  |^((M  +  V ’  y)T  TlF)  -  ^E((M  +  V  ’  y)T  2T1  iV(M  +  V  ’  y))| 

=  constant  +  £  j E[yrV ,T  E_1(F-  NM)]  ~^E(yTV ,T  YEXNV ’ y)| 

=  constant  +  ^|tr[F ,T  YT\F - NM)(Ey)T ]-Mx[V }T  NV ' E(yyT )\ | 

(44) 

Set  the  gradient  w.r.t.  V  to  zero: 

VrQ(V,V')  =  £[Z-‘(F -NM)(Ey)r -X~lNV'  E(yyT)~]  =  0  (45) 

s 

we  can  find  the  optimal  eigenvoice  matrix  to  be  the  solution  of  the  following  equation  system: 

X  NVE(yy  )  =  £  (F  -  NM)(Ey)T  (46) 

s  s 

This  equation  system  can  be  solved  row  by  row. 

The  entire  EM  algorithm  for  training  V  starts  by  initializing  V  randomly,  and  going 
through  the  above  iteration  for  a  number  of  times.  The  procedure  is  summarized  below, 
introducing  some  new  symbols  for  conciseness: 

•  For  each  speaker  s : 

■  Calculate  the  Baum-Welch  statistics  N(s)  and  F(s)  against  the  UBM; 

■  Calculate  the  “centered”  first-order  Baum-Welch  statistics  F(s)  =  F(s)  -N(s)M  ; 

•  Initialize  V  randomly. 

•  Repeat  for  a  fixed  number  of  times  or  until  convergence: 

■  For  each  speaker  s: 

♦  Calculate  the  matrix  G(s)  =  (/  +  VtTE1  N(s)Vyx ; 

♦  Calculate  the  statistics  about  the  posterior  distribution  of  y: 

*  E,  (s)  =  Ey  =  G{s)VT'L  x F {s) ;  (Eq.  (42)) 

*  E2(s)  =  E(yyr)  =  E^E^sf  +  G(s) ;  (Eq.  (43)) 
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■  Solve  the  equation  system  ^N(s)VE2 (s)  =  '^jF(s)El(s)r  ;  (Eq.  (46)) 

s  s 

■  Replace  V  with  V  . 

(2)  Training  the  V  matrix 

The  procedure  for  training  V  is  similar  to  that  of  training  U.  The  differences  include: 

•  In  the  main  loop,  things  are  calculated  for  each  utterance  instead  of  for  each  speaker; 

•  The  speaker  factors  y(s)  is  estimated  for  each  speaker  before  the  main  loop,  and 
subtracted  from  the  centered  first-order  Baum- Welch  statistics  of  each  utterance. 

The  entire  procedure  is  as  below. 

•  For  each  speaker  s: 

■  Calculate  the  Baum-Welch  statistics  N(s)  and  F(s)  against  the  UBM; 

■  Estimate  the  speaker  factors  y(s)  =  (I +  VTYFlN{s)VylVTYFl(F(s)-N{s)M) ;  (Eq. 

(31)) 

•  For  each  utterance  u  (suppose  it’s  spoken  by  s): 

■  Calculate  the  Baum-Welch  statistics  N(u)  and  F(u)  against  the  UBM; 

■  Calculate  the  “centered”  first-order  Baum-Welch  statistics 
F(u)  =  F(u)  -  N(u)[M  +  Vy(s)] ; 

•  Initialize  U  randomly. 

•  Repeat  for  a  fixed  number  of  times  or  until  convergence: 

■  For  each  speaker  u: 

♦  Calculate  the  matrix  G(u)  =  (I +  Ur’Z~1N(u)U)~l; 

♦  Calculate  the  statistics  about  the  posterior  distribution  of  x: 

❖  E1  (u)  =  Ex  =  G{u)UTY.lF{u) ; 

♦♦♦  E2(u )  =  E(xxt)  =  Ex(u)Ex(u)t  +  G(u) ; 

■  Solve  the  equation  system  ^ N(u)UE2 (u)  =  ^  F{u)El  ( u)T  ; 

u  u 

■  Replace  U  with  U  . 

(3)  Training  the  D  matrix 

•  For  each  speaker  s: 

■  Calculate  the  Baum-Welch  statistics  N(s)  and  F(s )  against  the  UBM; 

■  Estimate  the  speaker  factors  y(^)  =  (/  +  UrE_1Ar(5)U)_1UrE_1(F(5)-  N(s)M)  ; 

•  For  each  utterance  u  (suppose  it’s  spoken  by  s): 

■  Calculate  the  Baum-Welch  statistics  N(u)  and  F(u)  against  the  UBM; 

■  Estimate  the  channel  factors 

x(u)  =  (/  +  UTTFlN(u)Uyl  UrE^  [F(u)  -  N(u)(M  +  Vy(s ))] ; 

■  Calculate  the  “centered”  first-order  Baum-Welch  statistics 
F(u)  =  F(u)  -N(u)[M  +  Vy(s )  +  Ux(u)] ; 

•  For  each  speaker  s: 

■  Calculate  the  “centered”  first-order  Baum-Welch  statistics  F(s )  by  summing  up 
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the  F(u )  of  all  utterances  it  spoken  by  5. 

•  Initialize  D  to  a  random  diagonal  matrix. 

•  Repeat  for  a  fixed  number  of  times  or  until  convergence: 

■  For  each  speaker  s : 

♦  Calculate  the  matrix  G(s)  =  (I  +  D2!.  ' N(s)fl ; 

♦  Calculate  the  statistics  about  the  posterior  distribution  of  z: 

❖  El  ( s )  =  Ez  =  G(s)DY.  lF{s) ; 

❖  E2(s)  =  E(zz'  )  =  C,  (s)E]  (.s')7  +  G(s) ;  * 

■  Solve  the  equation  system  '^N(s)DE2 (5)  =  ^jF(s)El(s)r  ;  * 

s  s 

■  Replace  D  with  D  . 

Note  that,  because  D  is  a  diagonal  matrix,  for  the  equations  marked  with  an  asterisk,  we 
only  need  to  care  about  the  diagonal  elements. 

3.3.2  Variants  of  JFA  and  Our  Choices 

(1)  Order  of  training  the  V,  U,  D  matrices.  [14]  doesn’t  state  it  very  clearly  but  seems 
to  imply  that  they  trained  the  matrices  in  the  order  of  V-D-U.  However,  in  most  code  we  have 
found  (including  the  Matlab  demo  [15]  and  Alize),  the  matrices  were  trained  in  the  order  of 
V-U-D.  Our  tutorial  above  was  also  written  in  the  order  of  VU-D.  The  Janus  code  we  had  (see 
Section  4.3.1)  seemed  to  support  both  orders. 

The  two  different  training  orders  affect  the  data  for  training  D.  In  the  V-D-U  order,  D  is 
trained  before  the  eigenchannels,  therefore  there  must  be  multiple  utterances  for  the  same 
speaker  to  average  out  the  channel  effect.  On  the  other  hand,  in  the  V-U-D  order,  the  channel 
effect  in  the  utterances  for  training  D  can  be  removed  since  U  has  been  trained  already, 
therefore  it  is  possible  to  use  only  one  utterance  for  each  speaker.  However,  theoretically  the 
V-D-U  order  makes  more  sense,  since  it  first  trains  everything  related  to  speaker  variability, 
and  then  trains  the  rest  related  to  channel  variability. 

We  adopted  the  V-D-U  order  in  our  latest  experiments. 

(2)  Initialization  of  the  V,  U,  D  matrices.  This  is  not  carefully  stated  in  most  literature; 
usually  they  just  say  “random  initialization”  (e.g.  [14]).  However,  we  find  that  the  order  of 
magnitude  of  the  initial  value  is  important.  We  find  that  after  the  first  iteration,  the  directions 
of  the  columns  would  have  almost  converged,  and  in  the  subsequent  iterations  it  is  mostly  the 
magnitude  that  gets  adjusted.  When  the  initial  values  are  two  small,  it  would  take  more 
iterations  for  the  magnitude  to  converge;  when  the  initial  values  are  too  large,  the  magnitude 
would  decrease  at  a  very  slow  speed  and  wouldn’t  converge  even  in  hundreds  of  iterations. 
We  came  up  with  the  following  empirical  scheme  for  initializing  these  matrices: 

Let  L  =  CF  be  the  length  of  the  supervector,  X  be  the  mean  of  the  inverse  of  the 

elements  in  X  (the  block  diagonal  matrix  of  the  component  covariance  matrices),  and  N  be 

the  average  zeroth-order  Baum-Welch  statistics  (number  of  points)  per  component  and  per 
speaker.  Then,  initialize  the  elements  of  V  and  U  to  obey  the  nonnal  distribution 
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N  (0,  1  ILL  lN),  and  the  diagonal  elements  of  D  to  be  the  absolute  value  of  random 

variables  that  obey  the  nonnal  distribution  N  (0,  1  /  2T1  N ) . 

This  initialization  scheme  has  been  found  to  work  better  than  initializing  everything  to  be 
uniformly  distributed  within  the  interval  [0,1]  (as  implemented  in  Janus  and  Alize). 

(3)  Methods  and  number  of  iterations  in  training  the  V,  U,  D  matrices.  The  training 
procedure  of  the  matrices  is  an  iterative  procedure,  where  in  each  iteration  an  objective 
function  is  increased.  For  example,  in  the  training  of  the  V  matrix,  the  objective  function  is  as 
follows: 

L(V)  =  ^ogP(X(s)\V) 

(47) 

=  X log  J  P(v(s))  ■  P(X(s)  |  y(s),  V )  dy(s) 

rO) 

Regarding  y(s)  as  a  latent  variable,  we  built  an  auxiliary  function 

Q(V,  V ')  =  constant  +  ^  [  P(y(s)  \  X(s),  V )  log  P(X (s),  y(s )  |  V ')  dy(s)  (48) 

*  yte) 

which  is  maximized  in  each  iteration  with  a  series  of  updating  equations.  This  procedure  is 
called  “maximum  likelihood  (ML)  estimation”,  because  it  increases  the  likelihood  function 
L(V). 

Beside  the  most  common  ML  estimation,  in  [14],  another  updating  procedure  called 
“minimum  divergence  (MD)  estimation”  is  used.  Despite  the  name,  this  procedure  also 
increases  the  likelihood  function  L(V)  in  each  iteration.  It  is  also  an  EM  algorithm  but  chooses 
a  different  latent  variable  than  y(s),  and  its  name  comes  from  the  auxiliary  function  which  has 
the  form  of  the  KL-divergence  between  two  Gaussian  distributions. 

The  updating  equation  of  MD  estimation  is: 

V  <-  VJ,  where  JJT  =  -  £  E2  (5)  (49) 

S  s 

In  the  above  equation,  S  is  the  total  number  of  speakers,  and  E2(s)  =  LfvCsjyCs')7]  can  be 

calculated  according  to  Eq.  (43).  J  is  a  lower-triangle  matrix  called  the  “Cholesky 
decomposition”  of  the  right  hand  side.  (Some  formulas  in  [14]  include  another  term;  it  arises 
because  [14]  also  trains  the  M  in  the  JFA  model,  but  we  keep  M  fixed  to  the  UBM  mean.)  A 
complete  derivation  of  this  updating  equation  can  be  found  in  B. 

Since  MD  estimation  is  only  a  right-multiplication  onto  the  V  matrix,  it  does  not  change 
the  subspace  spanned  by  the  columns  of  V  (i.e.  the  channel  space),  but  only  performs  a  linear 
combination  upon  the  eigenvoice  vectors.  [14]  describes  the  advantage  of  MD  estimation  as 
“getting  good  estimates  of  the  eigenvalues  corresponding  to  the  eigenvoices”. 

Most  of  the  code  we  have  seen  uses  only  several  ML  iterations,  but  [14]  uses  7  ML 
iterations  plus  1  MD  iteration.  In  our  correspondence  with  P.  Kenny  (first  author  of  [14]), 
Kenny  claims  it  is  necessary  to  do  the  MD  iteration  in  order  for  the  magnitudes  of  the  matrix 
elements  to  be  interpretable.  In  reality,  we  found  that  the  MD  iteration  alters  the  magnitudes 
of  the  eigenvoices  /  eigenchannels  significantly.  After  the  ML  iterations,  the  lengths  of  the 
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eigenvoices  /  eigenchannels  are  almost  equal;  but  after  the  MD  iteration,  these  lengths  form 
the  shape  an  exponential  function.  However,  MD  iterations  did  not  contribute  significantly  to 
the  objective  function  nor  to  the  final  perfonnance;  also,  we  found  that  some  rows  of  V and  U 
tend  to  diverge  in  MD  iterations. 

In  our  latest  experiments  we  used  only  3  ML  iterations.  We  didn’t  use  MD  iterations  so 
that  we  could  compare  our  own  implementation  with  Janus  and  Alize;  we  did  only  3  iterations 
because  we  had  a  better  initialization. 

(4)  Estimating  x,  y,  z  in  speaker  enrollment.  In  Section  3.3. 1.3,  we  showed  how  to 
estimate  these  vectors  in  two  steps  (Eqs.  (32)  and  (33)).  In  fact,  it  is  also  possible  to  estimate 
them  jointly.  The  solution,  which  involves  more  complicated  math,  is  given  in  C.  The 
two-step  estimation  is  an  approximation  of  the  exact  joint  estimation,  but  the  deviation  is 
small  enough. 

The  references  [16]  and  [18]  introduce  a  “Gauss-Seidel-like  iterative  algorithm”,  which 
also  yields  an  approximate  solution.  It  has  the  advantage  of  being  easy  to  implement. 
However,  since  we  have  already  implemented  joint  estimation,  we  stuck  to  it  in  our 
experiments. 

3.3.3  Scoring  Methods  Special  to  JFA 

In  [13],  a  series  of  scoring  methods  special  to  JFA  are  compared.  With  the  exception  of 
full  frame-by-frame  scoring,  all  scoring  methods  are  approximate  and  are  based  on  the 
Baum-Welch  statistics.  The  score  of  a  trial  utterance  against  a  target  speaker  is  expressed  as  a 
function  of  the  Baum-Welch  statistics  N,  F  of  the  trial  utterance  (calculated  against  the  UBM), 
the  supervector  m  of  the  target  speaker  model,  and  the  concatenated  covariance  matrix  £ . 
Usually  this  score  is  adjusted  by  subtracting  the  score  of  the  utterance  against  the  UBM.  The 
adjusted  score  is  called  likelihood  ratio  (LLR),  and  will  also  contain  the  UBM  supervector  M. 

(1)  “nox”  -  no  consideration  of  channel  factors 

A  preliminary  approximation  is  to  replace  the  exact  frame-average  log-likelihood  in  Eq. 
(7)  with  the  EM  auxiliary  function,  which  simplifies  to: 

1  T  1 

scorenox(X,m)  =  —  [constant  +  m  'L  ( F — Nm )]  (50) 

n  2 

where  X  stands  for  the  trial  utterance,  and  n  is  the  length  (number  of  feature  vectors)  of  X.  The 
“constant”  term  does  not  depend  on  m.  The  LLR  will  be 

LLRnox  (X,  m)  =  —\mT'L~l  (F-- Nm)  -MTYFX  (F  -  -  NM)]  (5 1) 

"  n  2  2 

(2)  “intx”  -  integrating  out  the  channel  factors 

Recall  that  when  training  the  speaker  models,  the  channel  factors  x  are  discarded.  This 
means  the  target  speaker  supervectors  m  (as  well  as  the  UBM  supervector  M)  doesn’t  contain 
any  channel  factors.  Scoring  a  trial  utterance  against  m  effectively  ignores  the  channel  factors 
x  that  were  at  work  in  producing  X,  giving  rise  to  mismatch. 

Because  nothing  is  known  about  the  channel  factors  x  offf,  a  wise  way  to  deal  with  them 
is  to  integrate  x  out,  using  its  standard  normal  prior  distribution: 

scoreintx  (X,  m)  =  —  [  [constant  +  {m  +  Ux)T'L~x(F -—N(m  +  Ux))\  p{x\  0,1)  dx  (52) 
n3x  2 
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This  works  out  to 


score intx (X, m)  =  —  [constant  +  {F-Nm)T'L  'UGU'l.  1  (F  -  Nm)  +  m'  Y.  1(F-—Nm)\ 
n  2 

(53) 

where  G  =  (/  +  UTYFlNUy 1 .  The  LLR  is  just 

LLRintx  ix’ m )  =  score  intx  (X,  m)  -  scoreintx  (X,  M )  (54) 

(3)  “pointx”  -  point  estimate  the  channel  factors 

Instead  of  integrating  out  the  channel  factors  x,  we  can  also  find  a  point  estimate  of  x  that 
maximizes  the  posterior  probability  of  X,  i.e.  maximizes  the  integrand  of  Eq.  (52).  The 


solution  is  given  by 

xm=GUTTF\F-Nm )  (55) 

Therefore  the  score  function  is 

score  ^tx (X, m)  =  -[constant  +  (m  +  Uxm  )X  (F-|  N(m  +  Uxm )]  (56) 

n  2 

To  calculate  score poin[s  (2f,  M ) ,  another  point  estimate  of  x  is  needed: 

xM  =  GUtT  1  (F  -  NM )  (57) 

score  intx (X,M)  =  —  [constant  +  (M  +  UxTu )YFX (F-| N(M  +  UxM )]  (5 8) 

n  2 

The  LLR  is  just 

LLRpoi„tx  (3f,  m)  =  score  pointx  (X,  m)  -  scorepomtx  (X,  M )  (59) 

(4)  “ubmx”  -  UBM  point  estimate  the  channel  factors 

An  approximation  is  made  by  replacing  xm  with  xM  . 

1  1 

scoreubmx  (X,  m)  =  —  [constant  +  (in  +  UxM  )X  (F  —  N (m  +  UxM  )]  (60) 

n  2 

scoreubmx  (X,M)  =  -  [constant  +  (M  +  UxTu  )YFX  (F-|  N(M  +  UxM  )]  (61) 

n  2 


LLR vomU{X,m)  =  -[Gn-M)TI.-\F -NM -NUxM)-Um-M)TY.-lN{m-M)-\  (62) 
n  2 

(5)  “linear”  -  linear  approximation  of  (4) 

By  assuming  that  m-M  is  small,  the  quadratic  tenn  in  (62)  is  dropped,  resulting  in  the 
following  linear  LLR  function: 

LLRlinear(A,m)  =  -{m -M)tYT\F  -  NM  -  NUxM)  (63) 

n 

The  paper  [13]  compares  full  frame-by-frame  scoring  and  the  scoring  methods  (2—5 ) 
here  as  follows: 

•  In  tenns  of  accuracy: 

■  Without  score  nonnalization,  full  frame-by-frame  is  best;  intx,  pointx  and  linear 
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come  close;  ubmx  is  by  far  the  worst. 

■  With  ZT-norm,  full  frame-by-frame,  intx,  pointx  and  linear  are  all  similar,  with 
pointx  and  linear  reaching  the  best  under  some  conditions;  ubmx  is  somewhat 
worse. 

•  In  tenns  of  efficiency:  full  frame-by-frame  is  very  slow;  pointx  is  a  bit  slow;  the 
other  three  are  fast. 

•  Considering  both  aspects,  linear  scoring  seems  to  be  a  best  choice. 

The  “nox”  scoring  method  is  applicable  to  GMM  modeling  without  JFA.  The  linear 
scoring  is  also  applicable,  if  U  =  0  is  substituted  into  Eq.  (63).  The  other  three  scoring 
methods  are  not  applicable  to  GMM  modeling  without  JFA. 

3.4  Score  Normalization 


The  score  of  an  utterance  against  a  speaker  model  measures  their  similarity.  However,  a 
larger  score  doesn’t  always  mean  they’re  more  similar.  For  example,  a  speaker  model  may 
tend  to  yield  higher  scores  than  other  models  for  imposter  utterances.  In  this  case,  even  if  an 
utterance  scores  high  against  this  model,  it  doesn’t  necessarily  mean  that  this  is  the  most 
likely  model.  Also,  an  utterance  may  tend  to  score  higher  than  other  utterances  against 
models  that  do  not  represent  the  speaker  of  this  utterance.  In  this  case,  the  utterance  has  to 
score  even  higher  against  a  speaker  model  to  be  considered  a  match.  In  a  word,  what  matters 
is  not  the  score  itself,  but  rather  how  much  it  stands  out  from  the  scores  of  imposter  trials. 
Unfortunately  the  distribution  of  scores  of  imposter  trials  may  be  different  for  each  model  and 
each  utterance.  This  is  problematic  especially  for  open-set  SID,  because  we  need  to  set  a 
global  threshold  6  for  all  trials.  Score  normalization  is  the  technique  that  normalizes  the 
distribution  of  scores  of  imposter  trials,  so  that  a  fairer  comparison  of  models  could  be  made 
in  closed-set  SID,  and  a  global  threshold  could  be  set  in  open-set  SID. 

Two  common  normalization  techniques  are  zero  nonnalization  (Z-norm)  and  test 
nonnalization  (T-norm).  Z-norm  normalizes  the  scores  by  model: 


score  Znorm(X,m) 


score(X,  m)  -  jum 


(64) 


where  fim  and  am  are  the  mean  and  variance  of  the  distribution  of  scores  of  imposter  trials 
against  the  model  m.  The  imposter  utterances  are  taken  from  the  development  data,  and  the 
parameters  jum  and  <Jm  can  be  pre-computed  in  the  training  stage.  On  the  other  hand,  T-norm 
normalizes  the  scores  by  utterance: 


scoreT  (X,  m)  = 


scor jux 


<7  v 


(65) 


where  //A  and  crv  are  the  mean  and  variance  of  the  distribution  of  scores  of  X  against  a  set  of 
cohort  models.  The  cohort  models  are  trained  from  another  set  of  utterances  taken  from  the 
development  data.  Since  the  parameters  /ux  and  ax  rely  on  the  test  utterance,  they  must  be 
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estimated  in  the  testing  stage. 

The  two  normalization  techniques  can  be  applied  one  after  the  other,  and  the  compound 
normalization  techniques  are  called  ZT-nonn  and  TZ-nonn.  In  ZT-norm,  we  first  normalize 
the  scores  (of  both  target  models  and  cohort  models)  with  Z-nonn,  and  then  normalize  the 
resulting  scores  with  T-nonn.  In  TZ-nonn,  we  first  normalize  the  scores  (of  both  the  trial 
utterances  and  the  development  imposter  utterances)  with  T-norm,  and  then  normalize  the 
resulting  scores  with  Z-nonn.  In  order  to  ensure  that  the  scores  used  as  references  are  indeed 
from  imposter  trials,  the  cohort  set  and  the  development  imposter  set  must  not  overlap. 

We  studied  the  effect  of  score  normalization  with  open-set  SID  experiments  on  the 
ROSSI  database.  The  numbers  are  given  in  Section  4.2.  We  found  that  T-nonn  improved  the 
open-set  perfonnance  (while  not  affecting  closed-set  performance),  while  the  other 
normalizations  decreased  the  perfonnance  most  of  the  time. 

4.  RESULTS  AND  DISCUSSION 

4.1  Closed-Set  SID  Experiments 

We  compare  the  perfonnance  of  the  most  of  the  acoustic  features  and 
pre-/post-processing  techniques  by  the  closed-set  accuracy  on  the  ROSSI  database.  The 
numbers  are  shown  in  Table  2. 

The  naming  convention  is  as  follows:  The  core  part  of  the  feature  name  is  the  feature  type 
and  its  dimensionality,  for  example,  “MFCC21”  means  21 -dimensional  MFCC  feature.  If  the 
dimensionality  is  doubled  (e.g.  MFCC42),  then  it  means  delta  features  are  appended. 
Pre-/post-processing  techniques  are  attached  to  the  feature  name  as  prefixes  or  suffixes. 
“PRE”  stands  for  pre-emphasis,  “LP”  stands  for  linear  prediction,  “CMN”  stands  for  cepstral 
mean  normalization,  and  “GAUSS”  stands  for  short-time  Gaussianization. 

From  the  table  we  can  see  that  the  features  and  pre-/post-processing  techniques  that 
improve  the  closed-set  accuracy  include: 

•  Multitaper  MFCC  (especially  using  8  multipeak  tapers); 

•  DSCC; 

•  Pre-emphasis  and  delta  features; 

•  Short-time  Gaussianization. 


Table  2.  Performance  of  Various  Acoustic  Features  and  Pre-/Post-Processing 
Techniques,  Compared  by  Closed-Set  Accuracy  on  the  ROSSI  Database 


Feature 

Set  ID 

1 

2 

3 

4 

5 

6 

7 

8 

Average 

MFCC21CMN  (Baseline) 

89 

72 

63 

43 

41 

40 

65 

61 

59.250 

MFCC42  CMN 

86 

75 

66 

43 

44 

48 

65 

62 

61.125 

PRE  MFCC21  CMN 

91 

75 

62 

46 

41 

38 

67 

60 

60.000 

PRE  MFCC42  CMN 

92 

77 

65 

48 

48 

46 

70 

61 

63.375 
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WMVDR21  CMN 

88 

74 

64 

43 

40 

40 

65 

62 

59.500 

SCF20 

63 

58 

44 

17 

17 

15 

44 

43 

37.625 

SCF20PCA 

65 

64 

50 

20 

21 

15 

47 

46 

41.000 

MFCC21  CMN  +  SCF20 

75 

68 

56 

30 

24 

26 

57 

50 

48.250 

MFCC21  CMN  +  SCF20PCA 

83 

68 

60 

33 

26 

23 

59 

54 

50.750 

MFCC21  CMN  + 

DSCC20GAUSS100CMN 

(Compared  with  MFCC42  CMN) 

82 

78 

69 

48 

46 

47 

66 

64 

62.500 

Multitaper 

MFCC 

Baseline  (1  Hamming) 

-  our  implementation 

89 

72 

63 

43 

41 

40 

65 

61 

59.250 

Baseline  (1  Hamming) 

-  Kinnunen’s 
implementation 

88 

71 

64 

50 

40 

40 

63 

61 

59.625 

4  Thompson 

86 

70 

66 

49 

44 

38 

67 

63 

60.375 

8  Multipeak 

90 

75 

68 

50 

48 

39 

64 

62 

62.000 

8  SWCE 

91 

75 

65 

46 

48 

34 

67 

62 

61.000 

LP  MFCC21  CMN 

(Param  =  LP  order) 

1 

86 

72 

63 

43 

40 

38 

65 

61 

58.500 

2 

86 

76 

64 

47 

41 

39 

65 

59 

59.625 

5 

85 

75 

65 

41 

31 

35 

59 

62 

56.625 

10 

83 

73 

68 

37 

33 

30 

59 

57 

55.000 

15 

81 

72 

65 

37 

34 

31 

60 

58 

54.750 

20 

81 

71 

64 

38 

33 

30 

59 

57 

54.125 

MFCC21  GAUSS 

(Param  =  window  length  in  frames) 

100 

86 

74 

64 

51 

42 

42 

61 

64 

60.500 

200 

88 

76 

65 

55 

43 

43 

62 

65 

62.125 

300 

85 

76 

66 

51 

45 

43 

64 

64 

61.750 

400 

86 

75 

66 

52 

44 

46 

66 

62 

62.125 

500 

85 

76 

67 

54 

45 

44 

68 

64 

62.875 

PRE  MFCC42  GAUSS 

(Param  =  window  length  in  frames) 
(Compared  with  PRE  MFCC21  CMN 
+  DELTA) 

100 

89 

75 

72 

54 

43 

49 

66 

64 

64.000 

200 

93 

75 

72 

55 

45 

50 

66 

67 

65.375 

300 

91 

76 

70 

58 

45 

48 

67 

65 

65.000 

400 

93 

78 

68 

56 

46 

50 

68 

65 

65.500 

500 

92 

76 

68 

58 

45 

47 

66 

64 

64.500 

For  the  prosodic  features  HSCC  and  FFV,  and  their  fusion  with  the  baseline  MFCC,  we 
evaluated  their  performance  on  the  four  scenarios  on  the  MIXER5  database,  also  using  the 
closed-set  accuracy  as  the  criterion.  The  numbers  are  shown  in  Table  3,  and  the  change  of  the 
perfonnance  with  the  fusion  weights  is  visualized  in  Figure  6.  The  improvement  of  the  fusion 
over  baseline  MFCC  is  minor,  and  the  fusion  weights  are  hard  to  explain:  in  three  of  the  four 
scenarios,  FFV,  which  performs  worst  alone,  has  (one  of)  the  highest  weights. 
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Table  3.  Performance  of  HSCC,  FFV,  and  Their  Fusion  with  MFCC,  Compared  by 
Closed-Set  Accuracy  on  the  MIXER5  Database 


VB-YB 

VL-YL 

VB-YL 

VL-YB 

MFCC21 

70.43 

74.71 

59.29 

55.43 

HSCC20 

38.43 

44.14 

23.00 

23.00 

FFV7 

32.29 

36.57 

10.29 

13.71 

Best  fusion 

74.71 

79.86 

61.71 

56.86 

Best  fusion 

0.39  +  0.03  + 

0.35  +  0.05  + 

0.47  +  0.06  + 

0.87  +  0.04  + 

weights 

0.58 

0.60 

0.47 

0.09 

VB-YB 


VL-YL 
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Figure  6.  Fusion  of  MFCC,  HSCC  and  FFV:  How  the  Closed-Set  Accuracy  Changes 

with  the  Fusion  Weights 


4.2  Open-Set  SID  Experiments 

We  conducted  extensive  open-set  SID  experiments  on  the  ROSSI  database.  The  variables 
we  considered  include: 

•  Acoustic  features:  six  types  of  features  that  investigate  the  effect  of  pre-emphasis, 
delta  feature,  short-time  Gaussianization,  and  multitapers.  For  short-time 
Gaussianization,  we  chose  a  window  length  of  400  frames  which  has  been  shown  to 
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perform  best  for  closed-set  SID.  For  multitapers,  we  used  8  Multipeak  tapers. 

•  Speaker  modeling  method  and  kernel  function:  we  compared  GMM  speaker 
modeling  and  SVM  speaker  modeling;  for  the  latter,  we  considered  all  the  six  kernel 
functions. 

•  Score  normalization:  No  nonnalization,  Z-norm,  T-nonn,  ZT-norm  and  TZ-norm. 

For  each  combination  of  variables,  we  measured  the  4  criteria  for  each  of  the  8  sets  of  the 

ROSSI  database.  This  yielded  a  total  of  6x7x5x8x4  =  6720  numbers,  which  were  too 
many  to  analyze.  Therefore  we  aggregated  the  numbers  in  the  following  ways: 

•  The  numbers  across  the  8  sets  are  averaged; 

•  For  the  six  kernel  functions,  only  the  best  is  reported  along  with  the  name  of  that 
kernel. 

Now  there  are  6x2x5x4  =  240  numbers,  which  is  still  quite  a  lot  but  can  be  analyzed.  These 
numbers  are  listed  in  Figure  7. 

Figure  7  is  made  up  of  five  rows  and  two  columns,  where  each  row  stands  for  a  type  of 
score  nonnalization  and  the  two  columns  stand  for  the  two  scoring  methods.  Each  small  table 
has  five  rows  corresponding  to  features  and  four  columns  corresponding  to  the  evaluation 
criteria.  In  the  SVM  modeling  part,  the  name  of  the  best  kernel  is  shown  alongside  the 
numbers.  The  greener  the  background  is,  the  larger  the  number.  The  maximum  number  in 
each  column  of  the  small  tables  is  highlighted  in  red. 

The  following  conclusions  can  be  obtained  concerning  the  variables  we’re  interested  in: 

•  Scoring  method:  SVM  speaker  modeling  is  significantly  better  than  GMM 
modeling. 

•  Score  normalization:  T-nonn  improves  the  open-set  performance  (while  not 
affecting  closed-set  performance),  while  the  other  nonnalizations  decrease  the 
performance  most  of  the  time. 

•  Acoustic  feature:  The  best  feature  for  closed-set  SID  is 

PRE_PEAK8_MFCC40_GAUSS400  (with  every  technique  included).  The  best 
feature  for  open-set  SID,  if  we  focus  on  the  T-nonn,  is  PREMFCC42GAUSS400. 
Multitapers  do  not  seem  to  work  well  for  open-set  SID.  For  other  score 
normalizations  the  conclusion  may  be  a  bit  different,  but  there’s  no  doubt  that  the 
pre-emphasis  and  delta  features  should  be  included. 

•  Kernel  function:  The  numbers  favor  the  WBHATT  kernel  for  closed-set  SID,  and 
the  BHATT  or  GUMI  kernel  for  open-set  SID.  Experiments  that  compare  the 
performances  of  different  kernel  functions  show  that  the  PLAIN  and  KL  kernels  also 
achieve  a  similar  performance  with  negligible  differences,  but  the  L2  kernel 
performs  much  worse  and  can  be  eliminated. 
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GMM-UBM,  No-norm 


GMM-SVM,  No-norm 


MFCC21_CMN 

59.25 

55.65 

65.75 

69.30 

- 

P  R  E_M  FC  C42_C  M  N 

63.38 

59.40 

68.75 

71.59 

- 

M  FC  C2 1 _G AU  S  S400 

62.12 

58.80 

67.75 

71.10 

- 

P  R  E_M  FC  C42_G  All  S  S400 

65.50 

60.24 

68.38 

71.49 

- 

P  E  AK8_M  FC  C20_C  M  N 

62.00 

55.40 

64.25 

68.56 

- 

P  R  E  PEAK8  MFCC40  GAU  S  S400 

66,75 

58,11 

66,50 

70^0 

- 

Closed  accu. 

Open  accu. 

100- EER 

P0.5 

GMM-UBM,  2-norm 

MFCC21_CMN 

57.00 

55.01 

67.38 

69.27 

- 

P  R  E_M  FC  C42_C  M  N 

59.75 

57.51 

68.12 

72.22 

- 

M  FC  C2 1 _G AU  S  S400 

57.50 

55.34 

65.38 

69.69 

- 

P  R  E_M  FC  C42_G AU  S  S400 

60.25 

58.57 

67.25 

71.91 

- 

P  EAK8_M  FC  C20_C  M  N 

58.00 

54.22 

64.62 

68.98 

- 

P  R  E  PEAK8  MFCC40  GAU  S  S400 

61 ,75 

57  £5 

67,12 

72,59 

- 

Closed  accu. 

Open  accu. 

100- EER 

F0.5 

GMM-UBM,  T-norm 

MFCC21_CMN 
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57.40 

67.25 
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- 
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61.08 

68.88 

72.46 

- 
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72.12 

- 
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- 
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- 
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- 

Closed  accu. 

Open  accu. 

100- EER 

F0.5 

GMM-UBM,  ZT-norm 

MFCC21_CMN 
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54.95 

64  j  2 

67.83 

- 

P  R  E_M  FC  C42_C  M  N 
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67.88 

71.47 

- 
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54.32 
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68.23 

- 
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54.83 

65.38 

70.27 

- 
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55.08 

65.62 

69.75 

- 
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61 ,75 

57,37 
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- 
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GMM-UBM,  TZ-norm 
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69.66 

- 

PRE_MFCC42_CMN 

59.38 
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- 

M  FC  C2 1 _G AU  S  S400 

58.62 

55.05 

64.00 
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- 
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- 
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- 
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- 
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- 
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•  GUMI  67.88 
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Figure  7.  Open-Set  Evaluation  Results  on  the  ROSSI  Database 


4.3  Experiments  with  JFA 

4.3.1  Toolkits  Used  in  Our  Experiments  of  JFA 

We  implemented  everything  except  frame-by-frame  scoring  in  Matlab.  Because  our 
results  didn’t  turn  out  good  for  a  long  time,  we  also  used  two  toolkits  for  comparison:  Janus 
and  Alize.  We  compared  the  part  of  the  speaker  recognition  pipeline  after  calculating  the 
Baum-Welch  statistics,  including  the  initialization  and  training  of  the  V,  U,  D  matrices, 
speaker  enrollment,  and  scoring. 

Janus  refers  to  the  JFA  functionality  built  into  the  “Janus”  speech  recognition  software  of 
our  lab,  written  in  C++.  It  initializes  the  JFA  matrices  at  the  order  of  magnitude  of  1,  but  also 
supports  custom  initialization.  The  training  order  of  the  matrices  seems  to  be  flexible.  The 
enrollment  and  scoring  parts  are  too  slow  to  be  used. 

Alize  [19]  is  an  open  source  platfonn  for  biometric  authentication  developed  by  the 
University  of  Avignon,  including  the  functionality  of  JFA.  It  is  also  written  in  C++.  Like 
Janus,  Alize  initializes  JFA  matrices  at  the  order  of  magnitude  of  1  by  default  and  supports 
custom  initialization.  It  must  train  the  JFA  matrices  in  the  order  of  V-U-D.  In  speaker 
enrollment,  Alize  first  estimates  the  vectors  y  and  x  jointly,  then  estimates  the  z  vector.  The 
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scoring  methods  offered  by  Alize  include  frame-by-frame  scoring  (either  full  or  fast)  and 
linear  scoring. 

4.3.2  JFA  Experiment  Setup  and  Results 

In  our  experiments,  we  used  our  own  initialization  of  the  JFA  matrices  (which  has  been 
found  to  be  superior  to  the  default  initialization  of  Janus  and  Alize).  We  compared  the  training 
of  the  JFA  matrices  of  our  own  code  and  the  two  toolkits  ( V-D-U  with  Matlab  and  Janus,  V- 
-U-D  with  Alize).  For  speaker  enrollment,  we  used  our  own  code  of  joint  estimation  of  y,  x 
and  z.  The  two-step  estimation  of  Alize  doesn’t  exhibit  much  difference  from  joint  estimation. 
For  scoring,  we  used  Alize’s  fast  frame-by-frame  scoring  (choosing  5  top  components)  and 
our  implementation  of  the  five  scoring  methods  in  Section  3.3.3.  It  has  been  verified  that 
Alize’s  implementation  of  linear  scoring  produces  exactly  the  same  results  as  our 
implementation.  We  also  implemented  SVM  modeling  and  scoring. 

Because  we  didn’t  get  good  results,  and  because  the  results  seem  to  be  sensitive  to  many 
variables,  we  used  only  the  PLAIN  kernel  for  SVM  modeling  and  scoring,  and  didn’t  apply 
any  score  normalization. 

Below  are  our  latest  results  (EER)  obtained  on  the  sixth  subset  (telephone  train, 
telephone  test)  of  the  male  trials  of  the  NIST  2008  evaluation.  The  acoustic  feature  used  is 
RM40. 


Table  4.  Comparison  of  EER  of  various  training  methods  of  JFA  matrices  and  scoring 

methods 

on  the  sixth  subset  of  NIST  2008  male  trials 


Relevance 

MAP 

JFA  with 

Matlab 

JFA  with 

Janus 

JFA  with 

Alize 

Fast  frame-by-frame 

14.76 

15.66 

14.32 

15.90 

nox 

14.12 

15.38 

15.10 

15.21 

intx 

N/A 

15.34 

11.85 

15.15 

pointx 

15.34 

11.86 

15.16 

ubmx 

15.33 

16.70 

15.22 

linear 

12.93 

16.65 

12.47 

16.70 

SVM  (PLAIN 
kernel) 

9.77 

11.78 

11.78 

11.72 

It  appears  that  among  the  three  implementations  of  JFA,  only  Janus  shows  an 
improvement  over  the  relevance  MAP  baseline  (except  the  case  with  SVM).  Also,  the  trend 
of  the  perfonnance  of  the  intx,  pointx,  ubmx  and  linear  scoring  methods  agrees  with  those  in 
[13].  However,  when  we  investigated  what  was  the  difference  of  the  Janus  implementation 
from  the  others,  we  found  that  the  U  matrix  wasn’t  correctly  trained  by  Janus:  98  out  of  the 
100  columns  were  almost  identical.  We  tried  running  the  enrollment  and  scoring  with  only 
some  of  the  matrices  V,  U,  D,  and  found  that  it  was  exactly  the  “wrong”  U  matrix  that 
produced  the  low  error  rates.  We  tried  training  the  U  (or  D)  matrix  directly  after  training  the  V 
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matrix  with  the  three  implementations,  and  found  that  Janus  produced  different  results  from 
the  other  two.  The  D  matrix  trained  by  Janus  even  contained  negative  values  despite  the 
all-positive  initialization.  This  seems  to  be  evidence  that  the  Janus  implementation  is 
probably  wrong,  and  the  good  numbers  we  obtained  couldn’t  be  explained.  Also,  when  SVM 
modeling  and  scoring  are  used,  we  never  found  a  configuration  that  beat  the  relevance  MAP 
baseline.  We  suspect  that  the  following  two  factors  might  relate  to  the  no  success  of  JFA: 

1 .  The  database  used,  and  the  division  of  the  data.  The  NIST  database  is  huge,  and  its 
organization  is  complicated.  It  might  take  pages  to  exactly  specify  what  data  one  used  for 
what  part  of  the  experiments,  and  sometimes  one  may  even  have  to  provide  the  file  lists.  As  a 
result,  we  haven’t  been  able  to  replicate  the  data  division  of  any  literature,  and  it  is  possible 
that  we  have  run  all  our  experiments  on  a  not  so  reasonable  division  of  the  data. 

2.  Lack  of  a  working  configuration  of  JFA.  Although  we  had  access  to  the  source  code  of 
two  toolkits,  the  source  code  only  contained  functions  that  perform  the  individual  units  of 
functionality  in  JFA,  and  we  had  to  configure  the  pipeline  (e.g.  order  of  training  the  matrices) 
on  ourselves.  In  doing  so  we  naturally  tended  to  imitate  our  own  configuration,  and  if  our  own 
configuration  had  been  wrong,  the  toolkits  wouldn’t  have  helped  us  discover  our  mistake. 
Even  though  the  toolkits  can  be  useful  for  debugging  the  implementation,  we  might  still  have 
missed  the  “correct”  configuration. 

5.  CONCLUSION 

During  the  performance  period  of  this  project,  we  have  worked  on  all  the  modules  of 
speaker  recognition  systems.  At  the  front  end,  we  have  studied  a  variety  of  acoustic  features 
and  pre-/post-processing  techniques,  and  have  come  up  with  a  PPMD  feature  that  combines 
the  benefits  of  multitaper  MFCC,  DSCC,  pre-emphasis,  and  short-time  feature 
Gaussianization.  At  the  speaker  modeling  and  scoring  stages,  we  have  also  investigated  SVM 
and  JFA.  We  have  demonstrated  that  compared  to  GMM  modeling,  SVM  modeling  and 
scoring  are  not  only  better  and  also  faster.  We  have  also  shown  that  T-norm  of  the  scores 
improves  open-set  speaker  identification  performance  on  the  ROSSI  database. 
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A.  DERIVATION  OF  GMM  KERNELS 

In  this  appendix  we  give  the  derivation  of  the  six  GMM  kernel  functions  (PLAIN,  GUMI, 
KL,  L2,  BHATT,  WBHATT).  The  motivation  of  these  kernels  can  be  divided  into  three 
categories: 

•  Defining  a  “supervector”  for  each  GMM,  and  using  the  inner  product  of  the 
supervectors  as  the  inner  product  of  the  GMMs  (PLAIN); 

•  Defining  a  “distance”  measure  between  GMMs,  and  deriving  an  inner  product  from 
the  distance  (GUMI,  KL); 

•  Directly  defining  an  “inner  product”  on  the  probability  density  functions  of  GMMs 
(L2,  BHATT,  WBHATT). 

In  the  following  discussion,  we  only  consider  the  adaptation  of  GMM  means.  The 
covariances  and  component  weights  of  the  adapted  GMMs  are  identical  to  those  of  the  UBM. 

We  shall  denote  the  UBM  by  M  =  ( //, ,  £ vv; ) ,  and  the  adapted  GMMs  by  Ma  =  (//“,£., ma) 
and  Mb  =  (//* ,2.,w,.). 

A.l  The  PLAIN  kernel 

The  first  way  of  defining  a  GMM  kernel  is  to  define  a  supervector  for  each  GMM.  The 
PLAIN  kernel  constructs  the  supervector  by  concatenating  all  the  component  means.  The 
resulting  kernel  function  is: 

=  (A-l) 

This  kernel  is  simple,  but  one  might  criticize  it  for  not  taking  into  account  the  covariance 
matrices.  For  example,  it  is  reasonable  to  introduce  the  covariance  matrices  in  a  way  similar 
to  the  Mahalanobis  distance: 

=  (A-2) 

As  it  turns  out,  this  is  the  GUMI  kernel  to  be  discussed  next. 

A.2  The  GUMI  kernel  [A-l] 

The  second  way  of  defining  a  GMM  kernel  is  to  derive  it  from  a  distance  measure.  The 
distance  d  associated  with  an  inner  product  /•,•}  satisfies  d2(a,b)  =  (a-b,a-b)  .  If  the 
squared  distance  takes  on  a  form  in  which  the  difference  of  some  function  of  the  GMM 
models  / ( Ma)-f(Mh )  occurs  twice  in  a  multiplicative  relationship,  then  we  can  define  the 

replace  them  with  f(Ma )  and  f(Mb)  respectively  to  define  a  kernel  function.  The  GUMI 
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and  KL  kernels  are  two  examples. 

The  GUMI  (GMM-UBM  mean  interval)  kernel  is  derived  from  the  Bhattacharyya 
distance  of  two  probability  distributions.  The  Bhattacharyya  distance  between  two 

probability  distributions  pa(x )  and  ph(x)  is  defined  as: 


^Bhat,  (Pa  >  Pb)  =  ~ ln  J  -jpM)Ph(x)  d* 


(A-3) 


For  two  Gaussian  distribution  (not  GMMs)  Na (//" , I" )  and  Nh(p'\Y.h) ,  the  Bhattacharyya 
distance  is: 


(AT. , N„)  =  i (//“  - / )r (^-r‘ (/<■-/) 


(A-4) 


When  the  two  distributions  have  identical  covariance  matrices  (  =  Hh  =  E  ),  the 

Bhattacharyya  distance  reduces  to: 

AMW,JV»)A(A'-/)rZ'V  V)  (A-5) 

o 

For  GMMs,  however,  it  is  hard  to  calculate  the  Bhattacharyya  distance.  Nevertheless,  the 
effect  of  GMM  adaptation  manifests  itself  in  a  shift  of  the  means  of  the  individual 
components,  so  it  still  makes  sense  if  we  measure  the  distance  between  corresponding 
components.  A  GUMI  kernel,  therefore,  is  defined  for  two  Gaussian  distributions  by 
modifying  Eq.  (A-5)  and  dropping  the  constant: 

KGUMl(Na,Nb)  =  (<uayT.-\<ub)  (A-6) 


And  the  GUMI  kernel  for  GMMs  is  then  defined  as  the  sum  of  Eq.  (A-6)  over  all  components: 

^OUM,  (M. ,  M4  )  =  £,(/<,'  )rU  (rt‘ )  (A-7) 

(In  the  original  literature  the  GUMI  kernel  is  defined  as 

*0  =  (A-8) 

But  since  the  SVM  is  shift-invariant,  the  subtraction  of  the  UBM  means  can  be  dropped.) 


A.3  The  KL  kernel  [A-l][A-2] 


The  KL  kernel  is  based  on  the  KL  divergence  of  two  probability  distributions.  The  KL 
divergence  from  a  probability  distribution  pa{x)  to  another  probability  distribution  ph(x)  is 
defined  as: 


DKl(Pu  \\Pb)  =  \Pa  (*)  l0g 

J  n  i  vi 


Pb(x) 


(A-9) 


Note  that  this  is  an  asymmetric  function.  For  Gaussian  distributions  Ay (/./",!")  and 
Nb(jub ,Tjb)  ,  the  KL  divergence  is: 
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(A- 10) 


Ou.w,ii  n„) =|(/i“ -/)r(z‘r‘(/i“ -//) 

Note  the  asymmetry:  the  KL  divergence  depends  on  the  covariance  X*  but  not  X" .  However, 
when  the  two  distributions  have  identical  covariances,  the  KL  divergence  is  reduced  to: 

£>K1(A'J|lVs)  =  i(//--/)rE-,(//“-/)  (A- 1 1 ) 

which  is  essentially  the  same  as  the  Bhattacharyya  distance  (Eq.  (A-5)). 

From  this  point  on  we  may  derive  the  same  kernel  function  as  the  GUMI  kernel.  But  the 
authors  of  [A-2]  didn’t  want  to  jump  from  the  KL  divergence  of  GMMs  to  the  sum  of  KL 
divergences  of  individual  components  directly.  Instead,  they  worked  out  an  upper  bound  of 
the  KL  divergence  of  GMMs  using  the  log-sum  inequality: 

V  ,  •  »VY-V  .  A  (A'12) 

=  £,,WM  )  £,  (ft,  -ft,  > 

where  M“  and  Mb.  stand  for  the  /-th  component  of  the  two  GMMs.  And  from  this  they 
derived  the  KL  kernel  function: 

*KL  (M. ,  M„)  =  Y,, «,  (ft,“)r  £■'  (ft* )  (A- 1 3 ) 

It  turns  out  that  the  sole  difference  between  the  GUMI  and  KL  kernel  is  whether  the 
components  are  weighted.  There  is  some  approximation  involved  in  the  derivation  of  both 
kernels,  so  there  is  no  theoretical  justification  for  either  of  them.  The  paper  [A-l]  says  that  “in 
various  signal  selection  problems,  the  Bhattacharyya  distance  has  shown  to  give  better  results 
than  the  KL  divergence,”  but  we  still  need  to  do  experiments  with  our  own  data  to  see  the 
difference  in  perfonnance. 


A.4  The  L2  kernel  [A-2] 


The  third  way  to  define  a  GMM  kernel  function  is  to  treat  GMMs  as  continuous 
probability  density  functions,  and  use  a  general  inner  product  for  continuous  functions.  The 
L2  kernel  makes  use  of  the  L2  inner  product  of  real  continuous  functions: 

{f(*),g(x))L2  =  J  f(x)g(x)dx  (A- 14) 


For  two  Gaussian  distributions  Na(jua  ,'La)  and  Nb(jub ,Xfe)  in  a  D-dimensional  space,  the 
inner  product  is: 


{N„Nl)l2=N(M‘-M>\0,r+Z>) 


(In')011  Xa+Xfc 


m 


exp 


(jua  -jub)T(I.a  +Zby\jua  -/ub) 


(A- 15) 


With  identical  covariance  matrices,  the  inner  product  is  reduced  to: 
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(A- 16) 


{Na ,  Nb  >L2  =  N{fj.a  -  f/  |  0, 2Z) 

1 


nvn 

'  W-S)T S'V-//*)" 

Z 

1/2 

4 

(4  tt)DI2' 

The  inner  product  of  two  GMMs  is: 

{M„Mb)n  <A"  17) 

Under  the  assumption  that  non-corresponding  components  of  the  two  GMMs  are  situated  far 
away  so  that  they  do  not  make  significant  contribution  to  the  inner  product,  this  is 

approximated  to  yield  the  L2  kernel  function  (dropping  the  constant  term  (47t)/j  2 : 


L2 


-z, 


W: 


1 1/2 


exp 


(A- 18) 


The  L2  kernel  is  somewhat  to  the  radial  basis  kernel  function,  because  it  puts  the  quadratic 
form  in  an  exponential  function. 


A.5  The  BHATT  and  WBHATT  kernels 


The  radial  basis  kernel  function  has  a  nice  property:  because  of  the  positive  semidefinity 
of  the  E_1  matrix,  the  exponential  term  is  bounded  in  [0,1].  However,  in  the  L2  kernel,  the 

Z  term  breaks  this  bound.  Is  it  possible  to  eliminate  the  latter  tenn?  The  answer  is  yes  -  if 

we  use  the  Bhattacharyya  inner  product  instead  of  the  L2  inner  product. 

In  the  definition  of  the  Bhattacharyya  distance  (Eq.  (A-3)),  the  part  inside  the  logarithm 
is  also  a  valid  inner  product  for  two  continuous  functions: 

(/(*),  g(X>}Bhatt  =  J  <Jf(x)g(x)  dx  (A- 19) 

Because  the  geometric  mean  is  smaller  than  or  equal  to  the  arithmetic  mean,  for  two 
probability  distributions  (whose  integral  is  1),  the  Bhattacharyya  inner  product  is  bounded  in 

[0,1].  The  Bhattacharyya  inner  product  for  two  Gaussians  (//",!)  and  Nb{/uh ,Y)  with 


identical  covariance  matrices  is: 


A>A'»>Bh.«=exp 


h//1  Vfz- W) 

o 


(A-20) 


Similar  to  the  derivation  of  the  L2  kernel,  we  can  derive  the  weighted  Bhattacharyya 
(WBHATT)  kernel  for  GMMs: 


^WBHATT  (M a  ’  ^ b  )  ^  ,•  Wi  eXP 


(//; 

8 


(A-21) 


Just  like  the  GUMI  kernel,  one  could  argue  that  the  weighting  of  components  is  unnecessary, 
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(A-22) 


and  thus  we  have  the  unweighted  Bhattacharyya  (BHATT)  kernel: 

8 

Both  the  BHATT  and  WBHATT  kernels  are  bounded,  which  is  a  major  difference  from  the 
other  kernels. 

[A-l]  C.  H.  You,  K.  A.  Lee  and  H.  Li,  “GMM-SVM  kernel  with  a  Bhattacharyya-based 
distance  for  speaker  recognition,”  IEEE  Transactions  on  Audio,  Speech  and 
Language  Processing,  vol.  18,  no.  6,  pp.  1310-1312,  Aug  2010. 

[A-2]  W.  Campbell,  D.  Sturim  and  D.  Reynolds,  “Support  vector  machines  using  GMM 
supervectors  for  speaker  verification,”  Signal  Processing  Letters,  2006. 


-^BHATT  (Ma,Mb)  =  ^(exP 
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B.  DERIVATION  OF  MINIMUM  DIVERGENCE  ESTIMATION 


First  we  take  a  quick  look  at  maximum  likelihood  estimation.  The  objective  function  is 

Z(F)  =  I>  gP(X(s)\V)  (B-l) 

By  choosing  y(.s)  as  the  latent  variable  (which  is  a  natural  choice),  we  construct  an  auxiliary 
function: 

Q(V,V')  =  constant  +  ^  [  P(y(s)  \X(s),  V)  log  P(X (s),  y (5)  |  V ') 

s  vh) 

=  constant  +  ^  [  P(y(s)  \  X(s),V)logP(X(s)  |  MXbO 

i  V(A') 

Note  the  change  in  the  tenn  within  the  logarithm.  This  happened  because  (dropping  the 
argument  s) 

log P(X, y\V')  =  logP(X\y,V')  +  log P(y  V)  (B-3) 

But  the  (prior)  distribution  ofy  doesn’t  depend  on  V,  so  the  second  term  gets  absorbed  by  the 
“constant”  term  (remember  that  we’re  going  to  maximize  Q  in  terms  of  V  only). 

Now  we  make  an  “unnatural”  choice  about  the  latent  variable  -  we  choose  the  entire  Vy  , 

and  denote  it  by  a.  By  imitating  Eq.  (40),  we  can  write  out  the  auxiliary  function: 

Q  ’(V,  V ')  =  constant  +  ^  J  P(a  \  X,  V )  log  P(X,  a\V')  da  (B-4) 

s  a 

The  meaning  of  this  fonnula  isn’t  obvious  at  first  sight.  First,  the  range  of  a  doesn’t  cover  the 
entire  CF-dimensional  space,  so  the  probability  density  function  P(a  \  X,  V )  isn’t 

well-defined.  This  doesn’t  matter,  because  we  can  treat  the  entire  [  P(a  |  X,  V )[■]  da  part  as 

J  a 

an  expectation  operator  EaX  v  over  the  posterior  distribution  of  a: 

Q'(V,V')  =  constant  +  '^jEa]x  v\logP(X,ci  \  V ')]  (B-5) 

Next  let’s  study  the  tenn  within  the  logarithm.  We  can  decompose  it  in  a  similar  way  to  Eq. 
(B-3): 

log  P(X,  a  |  V ')  =  log  P(X  |  a,  V ')  +  log  P(a  \  V  ’)  (B-6) 

Now  look  at  the  first  tenn:  when  a  is  given,  the  distribution  ofXis  detennined,  and  V  has  no 
effect!  Therefore  this  time  it  is  the  first  tenn  that  gets  absorbed  by  the  “constant”  term.  The 
auxiliary  function  becomes 

Q'(V,V)  =  constant  +  ^  Ealx  v  [log  P(a  \  V  ’)]  (B-7) 

If  the  range  of  V  were  different  from  that  of  V,  then  log  P(X,a  \  V')  will  be  zero  almost 
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everywhere  over  the  posterior  distribution  of  a,  which  is  meaningless.  It  is  required  that  the 
range  of  F  be  identical  to  that  of  V,  and  this  is  why  minimum  divergence  doesn’t  change  the 
eigenvoice  space. 

Given  that  the  columns  of  V  and  F  span  the  same  subspace,  we  can  let  V  =  VJ ,  where  J 
is  a  small  invertible  matrix.  Let’s  think  about  what  P(a  \  V')  means:  it  means  that  a  variable 
y\  whose  prior  distribution  is  the  standard  normal  distribution,  takes  a  value  such  that 
V  y '  =  a  =  Vy ,  which  means  y  =  Jy ' .  So,  if  we  drop  the  notation  of  a  and  switch  to  v,  the 
auxiliary  function  will  become 

Q'(V,V)  =  constant  +  £  L^^log  P(y  =  Jy')]  (B-8) 


This  is  a  cross-entropy  from  the  posterior  distribution  of  y  to  the  distribution  of  Jy' .  If  we  add 


the  entropy  of  the  posterior  distribution  of  y  itself  (anyway  it  doesn’t  depend  on  F),  the 
auxiliary  function  becomes 


Q  '(V,  V')  =  constant  +  ^  Ey]XV 

S 


log  P{y  =  Jy') 
\ogP{y\X,V) 


(B-9) 


which  is  then  the  negative  of  the  KL-divergence  between  the  two  aforementioned  distribution. 
Remember  that  the  prior  distribution  of  v’  is  the  standard  nonnal  distribution.  Maximizing  the 
auxiliary  function  is  equivalent  to  minimizing  (the  sum  across  all  speakers  of)  the 

KL-divergence  from  the  posterior  distribution  of  y  to  the  distribution  of  Jy '  -  this  is  why  this 


updating  procedure  is  called  “minimum  divergence  estimation”. 

In  Section  3.3. 1.4  we  have  derived  that  the  posterior  distribution  of y  is  (see  Eq.  (41)): 


N  (E1(s),G(s)) 


(B-10) 


And  given  that  y  ~  N  (0,7) ,  the  distribution  of  Jy'  is 

N  (0 ,JJT)  (B- 11) 

The  KL-divergence  from  a  /.-dimensional  Gaussian  distribution  N0(//0,Z0)  to  another 


Gaussian  distribution  N  1(//1,21) ,  according  to  Wikipedia,  is: 


Aa(N0IIN,)  =  i 


tr 


( 2rl2:o )  +  ( A  -  Ao  )r  (A  -A))-  log 


^detL0A 

detL1 


-k 


(B-12) 


(The  sum  of)  the  KL-divergence  to  be  minimized,  dropping  the  terms  that  don’t  depend  on  F, 
is 


^Dkl=  y  )-‘  G(^)]  +  £-1 C^)7-  ( JL7r  )-‘  ^  (^)  +  lQg  cietc  ) )  (B-13) 

^  s 
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Let  K  =  (JJ  T )  1 ,  and  we  want  to  find  the  matrix  K  that  minimizes  this  sum  of  KL-divergence. 

We  will  take  its  derivative  to  K.  Here  are  some  formulas  of  matrix  calculus  that  come  in 
handy: 

—  tr  (ABr)  =  B  —(xtAv)  =  xvt  —detA  =  (adjA)T=detA-(A1)r(B-\4) 

dA  8A  8A 

where  adj  A  is  the  adjugate  of  A.  Also,  since  the  matrices  we’re  dealing  with  are  all 
symmetric,  we  don’t  need  to  care  about  the  transposes. 

The  derivative  of  ID,.,  w.r.t  K  is 


_8_ 

8K 


KL 


]__8_ 
2  8K 


^  ( tr[KG(s)]  +  El  ( s)T  KEX  (5)  -  log  det  K  j 


G(s)  +  El(s)El(Sy  - 


T  det  KK 


S  V 


det  K 


-1  x 

) 


\YXe^~kA 


Set  this  derivative  to  zero,  and  we’ll  have 


K-'  =JJT 

&  s 


(B-15) 


(B-16) 


By  Cholesky  decomposition  we  can  find  J,  and  update  V  by  V  <—  VJ .  The  update  fonnulas 
for  D  and  U  are  similar. 
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C.  JFA  SPEAKER  ENROLMENT:  JOINT  ESTIMATION  OF  j t,y,  z 

Analogous  to  Eq.  (32),  joint  estimation  of  y,  x,  z  entails  calculating 

[x;y;z\  =  (I  +  [U  V  D]tTP1N[U  V  D]) \U  V  D]tTT\F -NM)  (C-l) 

The  hard  part  of  this  is  the  matrix  inversion  (I  +  [U  V  DfSPlN\U  V  D])1 .  To  simplify  the 


notation,  let  W  =  [U  V]  and  S  =  Z  1 N ,  then  the  inversion  becomes: 

(C-2) 

In  this  fonnula,  W  is  a  tall  but  think  matrix,  and  D  and  S  are  large  diagonal  matrices.  So  the 
matrix  to  be  inverted  has  a  huge  bottom-right  comer  that  is  diagonal. 

We’ll  invert  this  matrix  by  row  transfonnations.  In  the  procedure,  we’ll  introduce  more 
letters  to  simplify  the  notation.  We  start  with 

~I  +  WTSW  WtSD 
DSW  I  +  DSD 


I  0 
0  / 


(C-3) 


(I  +  [W  D]tS[W  £>])-'  = 


I  +  WTSW  wtsd 
DSW  I  +  DSD 


Multiply  the  second  row  with  (/  +  DSD )  1 ,  and  subtract  from  the  first  row  WT SD  times  the 
new  second  row: 

"  /  +  WT  SW  -WT  SD(I  +  DSD)1  DSW  0 
(I  +  DSD)1  DSW  I 


I  -WtSD(I  +  DSD)' 


o  c i+dsd r 


(C-4) 


Let’s  look  at  the  top-left  block.  Remember  that  (I  +  A)  1  can  be  expanded  as 
/  -  A  +  A2  -  A3  ,  therefore 

I +  WTSW-  WtSD(I  +  DSD)1  DSW 

=  1  +  WTSW  -  WtSD(I  -  DSD  +  DSD2SD  -  DSD2 SD2 SD  ■  ■  ■ )DSW 
=  I  +  WT(S-  SD2S  +  SD2SD2S  -  SD2SD2SD2S  +  SD2  SD2  SD2  SD2  S  •  •  -)W  (C-5) 
=  /  +  WT  (I  +  SD2)'1  SW 
=  I +  WT  (I  +  DSD)1  SW 


Now  let  P  =  (I  +  DSD)  1 ,  then  we  have 


~I  +  WTPSW  0 

I 

-wtsdp~ 

PDSW  I 

0 

p 

(C-6) 


Let  H  =  I  +  WT PSW ,  and  multiply  the  first  row  with  H  1 : 
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/  0 

H' 

-hxwtsdp~ 

PDSW  I 

0 

p 

Subtract  from  the  second  row  PDSW  times  the  first  row: 


7  o 

H' 

-HlWTSDP 

0  / 

-PDSWH1 

p+pdswhxwtsdp 

(C-7) 


(C-8) 


Now  the  part  on  the  right  side  of  the  dotted  line  is  the  inverse  of  we  wanted  to  find.  Its 
computation  involves  the  inversion  of  two  matrices  ( I  +  DSD  and  II).  The  former  is  a 
diagonal  matrix,  and  the  second  is  small;  the  computation  of  both  is  tractable. 

We  can  go  on  to  calculate  the  estimates  of  y,  x  and  z: 


[x;y;z]  =  (/  +  [W  D]T  S[W  D])  1  [W  D]r Z (F  -  NM) 


H' 

-HlWTSDP 

-1 

~WT~ 

PDSWH~l 

P  +  PDSWHxWt  sdp 

D 

ir\F-NM )  (C-9) 


HXWT  {I  -  SDPD) 

PD  -  PDSW  1 1  'IV  (I  -  SDPD ) 


5T l(F-NM ) 


To  break  this  further  down,  we  have: 

[x-,y]  =  H  xWT(I-SDPD)L-\F -NM)  (C-10) 


z  =  PDTP1  (F  -  NM)  -  PDSW[x;  y] 
=  PDYP1  {F  -  N(M  +  W[x;  v])} 


(C-ll) 


We  can  see  that  Eq.  (C-ll)  is  exactly  the  same  as  Eq.  (33).  That  means  in  both  two-step 
estimation  and  joint  estimation,  the  same  fonnula  is  used  to  calculate  z  when  x  and  y  are 
known.  However,  in  joint  estimation,  the  matrix  D  also  comes  into  play  in  the  estimation  ofx 
and  v. 
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LIST  OF  SYMBOLS,  ABBREVIATIONS  AND 

ACRONYMS 


BHATT 

CA 

CDF 

CMN 

CMS 

CR 

DSCC 

EER 

EM 

A 0.5 

FA 

FFV 

FR 

GMM 

GUMI 

HMM 

HSCC 

IROSIS 

JFA 

KL 

L2 

LLR 

LMS 

LP 

MAP 

MD 

MFCC 

MHEC 

ML 

NIST 

PPMD 

RASTA 

RM40 

ROSSI 

SCF 


Bhattacharyya  (a  kernel  function) 
correct  accept 

cumulative  density  function 
cepstral  mean  normalization 
cepstral  mean  subtraction 
correct  reject 

delta-spectral  cepstral  coefficients 
equal  error  rate 
expectation-maximization 

precision  biased  correct  decision  rate  (an  open-set  evaluation  criterion) 
false  accept  /  false  alarm 
fundamental  frequency  variation 
false  reject 

Gaussian  mixture  model 

GMM-UBM  mean  interval  (a  kernel  function) 

hidden  Markov  model 

hannonic  structure  cepstral  coefficients 

integrated  robust  open-set  speaker  identification  system 

joint  factor  analysis 

Kullback-Leibler  (divergence,  also  a  kernel  function) 

a  kernel  function 

log-likelihood  ratio 

least  mean  squares 

linear  prediction 

maximum  a  posteriori 

minimum  divergence 

Mel-frequency  cepstral  coefficients 

mean  Hilbert  envelope  coefficients 

maximum  likelihood 

National  Institute  of  Standards  and  Technology 
PRE_PEAK8_MFCC20_GAU  S  S3  00  + 

PRE_PEAK8_DSCC20(GAUSS300)_CMN  (an  acoustic  feature) 
relative  spectral  (filtering) 
an  acoustic  feature 

robust  open-set  speaker  identification 
spectral  centroid  frequency 
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SCM 

spectral  centroid  magnitude 

SE 

speaker  error 

SID 

speaker  identification 

SNR 

signal-to-noise  ratio 

SRE 

speaker  recognition  evaluation 

SVM 

support  vector  machine 

SWCE 

sine-weighted  cepstrum  estimator  (a  type  of  multitapers) 

UBM 

universal  background  model 

VAD 

voice  activity  detection 

WBHATT 

weighted  Bhattacharyya  (a  kernel  function) 

WMVDR 

warped  minimum  variance  distortionless  response 
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